Configuration files
In addition to a basic command-line interface, most of the scripts that are part of RNA-clique can be controlled via YAML configuration files. Configuration files can be useful for making analyses more reproducible or for performing multiple analyses with similar parameters. Many RNA-clique programs can also output configuration files to store the parameters of the analysis performed in a machine-readable format that can be readily used to repeat the analysis or as a template for further analyses with similar parameters.
The document first provides a basic template configuration file with default or empty values for all settings. This document then describes how configuration settings may be derived automatically from others and how configuration files interact with the command-line interface. This document discusses optional values and how file paths are interpreted. Finally, this document provides a reference for the settings that can be provided in RNA-clique configuration files and explains how to troubleshoot RNA-clique's configuration.
Template
# Version of the configuration schema used.
config_version: 0.0.1
# Name to assign to the analysis.
title:
# Directories containing the transcript FASTA files.
input_dirs:
# Output directory containing top n genes by coverage.
top_genes_dir:
# Output directory containing gene matches tables.
tables_dir:
# Intermediate directory containing BLAST DB caches.
cache_dir:
# Output directory root.
output_dir:
# Output gene matches graph.
graph:
# Number of top genes by k-mer coverate to select.
top_genes:
# Name of transcripts files in input directories.
transcripts_name: transcripts.fasta
# Threshold for counting a match between two genes.
top_matches: 1
# e-value threshold to use for BLASTn searches.
evalue: 1e-99
# Keep all matches between genes in the case of ties.
keep_all: true
# Number of parallel jobs to use.
jobs: 31
# Python regex to use for parsing transcript IDs.
transcript_id_regex: ^.*cov_([0-9]+(?:\.[0-9]+))_g([0-9]+)_i([0-9]+)
# Mapping from paths to sample names.
path_to_sample:
# Output distance matrix location.
matrix:
# When the last analysis associated with this config file finished.
finished:
# Version of RNA-clique used to create this analysis.
version: v0.2.0
# Path to analysis of which this is a subset.
subset_of:
# Print more output than usual.
verbose: false
Derived settings
In some cases, RNA-clique will derive good default values for some settings automatically from the values of other settings. Those default values can be used if the user does not specify a value for the setting explicitly. RNA-clique produces such default settings to simplify RNA-clique commands for common use cases by allowing users to provide fewer command-line arguments. Although RNA-clique allows command-line invocations simplified in this way, it retains the flexibility gained from having many command-line arguments by still allowing potentially derived settings to be specified explicitly.
For example, RNA-clique will automatically derive values for top_genes_dir,
cache_dir, tables_dir, graph, matrix, and output_config from
output_dir, which means only a root output_dir value must be provided. Any
of the derived default values for these settings can be overridden, however,
when the setting is defined explicitly by the user.
How settings can be derived from others depends on the script being used, though many scripts share similar rules. These rules are documented for each script in the Command-line usage guide. When a setting can be derived from another for some script, the rule is described in the "default value" column of the row for the derived setting in the table under the script's "Options" subsection. In most cases, the rule is described in a kind of succinct pseudocode, but when the rule cannot be expressed so succinctly, the table simply documents the settings on which the derived setting depends. Rarely, there may be multiple possible rules for deriving a value. In such cases, the table simply lists the default value as "dynamic," and the rules are explained in more detail beneath the table.
Interaction with command-line arguments
RNA-clique can be controlled via command-line arguments and configuration files
simultaneously by providing both command-line arguments and a configuration file
via the --input-config/-C option. Using settings from both sources can make
RNA-clique more convenient to use; for example, it makes it possible to put
common settings in a configuration file and make further adjustments to settings
without requiring that the user open an editor to change the configuration file
each time the user wants to try a different setting. Unfortunately, accepting
settings from multiple sources inevitably raises questions about how those
sources should interact, and the problem is exacerbated somewhat when certain
settings can be derived automatically from other settings, as they are in many
of RNA-clique's programs. This section briefly discusses rules for using
command-line arguments and configuration files simultaneously. Of course, a user
can always opt to use only one source at a time should they prefer to avoid the
complexity of the interactions described here.
Currently, RNA-clique initially processes the arguments provided directly on the command line. Where possible, any rules for deriving settings not provided explicitly are applied until no further derivations are possible.
After the command-line arguments have been processed, RNA-clique loads the settings provided in the configuration file (if specified). Settings specified explicitly in the configuration file do not override any settings specified explicitly at the command line or derived from the command-line settings. After the explicit configuration file settings have been loaded, the rules for deriving settings not provided explicitly are again applied to derive missing settings.
Since the rules above are slightly complex, it is recommended that users
manually using configuration files and command-line arguments simultaneously
take advantage of the --show-config option
to verify that RNA-clique is using the expected settings.
Optional values
Configuration files provided to RNA-clique programs need not provide values for all possible settings. Values missing from the configuration can be provided via command-line arguments or derived from the provided values automatically, and an RNA-clique program will not fail if settings it does not need are missing. The Command-line usage guide documents which settings must be present (from any source) in the "Required" column of each script's "Options" table.
File paths
When a setting is assigned a relative file path, the file path is interpreted as
relative to the working directory of the running script rather than the
directory of the configuration file or the analysis root. This behavior ensures
that relative paths are treated the same whether they are provided via a
configuration file or via a command-line argument, but it can cause some
confusion when reusing a config file from a different directory. Future versions
of the RNA-clique configuration schema may introduce a working_directory
setting to document where the analysis was executed.
Settings
| Setting | Python type | YAML type | Description |
|---|---|---|---|
config_version |
str |
Scalar | Version of the configuration schema used. |
title |
str |
Scalar | Name to assign to the analysis. |
input_dirs |
list[pathlib.Path] |
Sequence of Scalar | Directories containing the transcript FASTA files. |
top_genes_dir |
pathlib.Path |
Scalar | Directory containing top n genes by coverage. |
tables_dir |
pathlib.Path |
Scalar | Directory containing gene matches tables. |
cache_dir |
pathlib.Path |
Scalar | Directory containing BLAST DB caches. |
output_dir |
pathlib.Path |
Scalar | RNA-clique analysis output root directory. |
graph |
pathlib.Path |
Scalar | Gene matches graph. |
top_genes |
int |
Scalar | Number of top genes by k-mer coverate to select. |
transcripts_name |
str |
Scalar | Name of transcripts files in input directories. |
top_matches |
int |
Scalar | Threshold for counting a match between two genes. |
evalue |
float |
Scalar | e-value threshold to use for BLASTn searches. |
keep_all |
bool |
Scalar | Keep all matches between genes in the case of ties. |
jobs |
int |
Scalar | Number of parallel jobs to use. |
transcript_id_regex |
re.Pattern |
Scalar | Python regex to use for parsing transcript IDs. |
path_to_sample |
dict[pathlib.Path, str] |
Mapping from Scalar to Scalar | Mapping from paths to sample names. |
matrix |
pathlib.Path |
Scalar | Output distance matrix location. |
finished |
datetime.datetime |
Scalar | When the last analysis associated with this config file finished. |
version |
str |
Scalar | Version of RNA-clique used to create this analysis. |
subset_of |
pathlib.Path |
Scalar | Path to analysis of which this is a subset. |
verbose |
bool |
Scalar | Print more output than usual. (Useful for debugging.) |
config_version
RNA-clique configuration files can state what version of the configuration
schema they use. This setting is designed to provide a measure of
future-proofing in case eventual changes to the schema require explicit
detection of the schema version to provide backwards compatibility. Currently,
the schema version is 0.0.2, but no programs in RNA-clique use the schema
version.
input_dirs
The input_dirs setting should be a list (YAML Sequence) of file paths to
directories containing the transcriptome files.
top_genes_dir
The top_genes_dir setting should be a path to a
directory containing the top genes file.
tables_dir
The tables_dir setting should be a path to a directory containing the gene
matches tables.
cache_dir
The cache_dir is a directory in which RNA-clique will store a simple_blast
BLAST database
cache
containing BLAST databases for the top genes files. cache_dir is an
intermediate output directory not intended to be used by the RNA-clique user.
output_dir
output_dir, when provided, is used as the default "root" for the RNA-clique
analysis. Paths for for top_genes_dir, cache_dir, tables_dir, graph,
matrix, and output_config will usually be automatically
placed under the output_dir when those settings are
not provided explicitly. (Check the relevant "Options" section for the script in
the Command-line usage guide to verify this behavior for the script
being used.)
graph
The graph setting should be a path to the gene matches
graph.
top_genes
top_genes is the number of top genes to select by \(k\)-mer coverage when
creating the top genes files. In the original RNA-clique
paper, top_genes is referred to as parameter \(n\).
\(k\)-mer coverage quantifies the amount of sequence data that contributes to an assembled transcript. Some assemblers, including rnaSPAdes, report \(k\)-mer coverage values for each assembled transcript. When an assembler can report multiple isoforms of a single transcript, transcripts can be grouped into "isotig sets" representing isoforms of the same gene. (rnaSPAdes does this grouping automatically, but a separate program could also be used to organize transcripts into isotig sets.) Although the assembler only assigns \(k\)-mer coverage values to individual transcripts, RNA-clique takes the \(k\)-mer coverage of a gene to be the maximum \(k\)-mer coverage among all transcripts that belong to the same gene (isotig set).
When RNA-clique selects the top \(n\) genes by \(k\)-mer coverage, it is intended to select those genes best supported by the RNA-seq data. This step is intended to both reduce errors caused by poorly assembled transcripts and reduce the amount of time needed to perform the remainder of the analysis. What setting of \(n\) is best depends on the data, but \(n = 50000\) worked well for the analyses performed in the original RNA-clique.
transcripts_name
RNA-clique expects all input transcriptome FASTA
files to have the same filename. By default, RNA-clique expects them all to be
named transcripts.fasta since this is the default output filename for
rnaSPAdes, but this name can be customized
using the transcripts_name setting.
top_matches
When comparing sample \(A\) and sample \(B\), we BLAST \(A\) against \(B\) and \(B\)
against \(A\). Ordinarily, we keep a pair of genes \(g\) (from \(A\)) and \(h\) (from
\(B\)) when merging the results from the two directions if and only if \(g\) is
among the best matches for \(h\) in \(A\), and \(h\) is among the best matches for
\(g\) in \(B\), according to bitscore. (We allow ties, so \(h\) may not be the only
best match for \(g\) in \(B\), and, likewise, \(g\) may not be the only best match
for \(h\) in \(A\).) This behavior corresponds to a parameter setting of
top_matches = 1 (\(N = 1\) in the original RNA-clique paper) because we are
consider only the matches with top \(N = 1\) bitscore in both directions.
We could alternatively set \(N\) to some value greater than 1. In that case, when we merge the two directions, we could keep a pair of genes \(g\) and \(h\) if and only if \(g\) is among the matches with top \(N\) bitscore for \(h\) in \(A\), and \(h\) is among the matches with top \(N\) bitscore for \(g\) in \(B\).
As of this writing, values for top_matches greater than \(1\) are mostly
untested, and it is recommended that this parameter simply be set to \(1\) in
practice.
transcript_id_regex
The transcript_id_regex setting should be a Python regular expression that can
be used to parse the FASTA sequence header lines of the transcript FASTA
files. The following capture groups are expected and can be identified by
position or by
name.
| Position | Name | Description |
|---|---|---|
| 1 | coverage | \(k\)-mer coverage, expressed as a floating-point number |
| 2 | gene | Gene ID, a non-negative integer |
| 3 | isoform | Transcript isoform ID within the gene. |
When some capture groups in a regular expression are named, and others are unnamed, the unnamed capture groups are assumed to be the remaining capture groups from the table above, in order by their position.
Any extra capture groups beyond those in the table above are ignored.
Examples
In the regular expression below, the default order of capture groups is
used. The first capture group (after the string foo, but before bar) is the
coverage. The second capture group (after the string bar, but before baz) is
the gene ID. The third and last capture group (after the string baz, but
before _) is the isoform ID.
The regular expression below is the same, but capture groups are identified explicitly.
In the regular expression below, any text after the isoform ID is captured, but RNA-clique will ignore it.
In the following regular expression, the first capture group is the gene ID. The second represents the coverage, and the third represents the isoform ID.
keep_all
The last step in creating a gene matches table
is selecting the top gene pair for each sample 1 gene by bitscore. When
keep_all is False, this step produces a table such that every sample 1 gene in
the table is implicitly mapped to a single best match in sample 2. If for some
sample 1 gene there are multiple gene pairs with highest bitscore, ties are
broken arbitrarily by keeping only the row that comes first in the table.
When keep_all is True, RNA-clique allows more than one gene pair to be kept
for a sample 1 gene in the case of ties.
path_to_sample
The path_to_sample setting should be a dict (YAML mapping) mapping top
genes files to the names of the samples they
represent. By default, RNA-clique assumes that each sample is named after the
directory in which its transcripts FASTA file is found. This default behavior
is reflected in output YAML files produced by RNA-clique, but the mapping can be
changed manually to assign different names to samples.
Examples
The example below uses the set of 16 tall fescue samples analyzed in the original RNA-clique paper. All top genes files here are mapped to their default sample names.
path_to_sample:
f16_rna_clique_out/od1/SRR2321383_top.fasta: SRR2321383
f16_rna_clique_out/od1/SRR2321384_top.fasta: SRR2321384
f16_rna_clique_out/od1/SRR2321385_top.fasta: SRR2321385
f16_rna_clique_out/od1/SRR2321386_top.fasta: SRR2321386
f16_rna_clique_out/od1/SRR2321387_top.fasta: SRR2321387
f16_rna_clique_out/od1/SRR2321388_top.fasta: SRR2321388
f16_rna_clique_out/od1/SRR7990321_top.fasta: SRR7990321
f16_rna_clique_out/od1/SRR7990322_top.fasta: SRR7990322
f16_rna_clique_out/od1/SRR8003736_top.fasta: SRR8003736
f16_rna_clique_out/od1/SRR8003737_top.fasta: SRR8003737
f16_rna_clique_out/od1/SRR8003753_top.fasta: SRR8003753
f16_rna_clique_out/od1/SRR8003754_top.fasta: SRR8003754
f16_rna_clique_out/od1/SRR8003755_top.fasta: SRR8003755
f16_rna_clique_out/od1/SRR8003756_top.fasta: SRR8003756
f16_rna_clique_out/od1/SRR8003761_top.fasta: SRR8003761
f16_rna_clique_out/od1/SRR8003762_top.fasta: SRR8003762
In the example below, the samples have been renamed to reflect their genotypes.
path_to_sample:
f16_rna_clique_out/od1/SRR2321383_top.fasta: CTE46_1
f16_rna_clique_out/od1/SRR2321384_top.fasta: CTE46_2
f16_rna_clique_out/od1/SRR2321385_top.fasta: CTE46_3
f16_rna_clique_out/od1/SRR2321386_top.fasta: CTE46_4
f16_rna_clique_out/od1/SRR2321387_top.fasta: CTE46_5
f16_rna_clique_out/od1/SRR2321388_top.fasta: CTE46_6
f16_rna_clique_out/od1/SRR7990321_top.fasta: FATG4_1
f16_rna_clique_out/od1/SRR7990322_top.fasta: FATG4_2
f16_rna_clique_out/od1/SRR8003736_top.fasta: NTE_1
f16_rna_clique_out/od1/SRR8003737_top.fasta: NTE_2
f16_rna_clique_out/od1/SRR8003753_top.fasta: CTE27_1
f16_rna_clique_out/od1/SRR8003754_top.fasta: CTE27_2
f16_rna_clique_out/od1/SRR8003755_top.fasta: CTE27_3
f16_rna_clique_out/od1/SRR8003756_top.fasta: CTE27_4
f16_rna_clique_out/od1/SRR8003761_top.fasta: CTE27_5
f16_rna_clique_out/od1/SRR8003762_top.fasta: CTE27_6
matrix
The matrix setting is a path to the distance
matrix.
subset_of
When one analysis uses a subset of samples from another, reusing its top
genes and gene matches
tables, the subset_of setting should be set
in the subset configuration file to a file path to the superset configuration
file.
make_subset uses subset_of to determine which
analysis to subset, but no other programs in RNA-clique currently read and use
the subset_of parameter. Nevertheless, subset_of is useful as metadata to
keep track of the relationships between analyses.
Troubleshooting configuration
If RNA-clique doesn't seem to be working as you expected, or if you would simply
like to verify that RNA-clique is processing your command-line arguments and/or
configuration file correctly before running a long analysis, it can be helpful
to see exactly what parameters RNA-clique will use to run a script. All scripts
in RNA-clique that support configuration files provide this feature via the
--show-config option, which allows the user to see the configuration or
arguments that RNA-clique will use.
By default, --show-config prints to standard output a YAML configuration file
containing the effective settings that will be used by the program. Arguments
can also be provided along with --show-config to show the original "raw"
parsed command-line arguments (original_args), the processed command-line
arguments (args), or the processed configuration (config, the
default). Multiple such arguments can be provided to show multiple sets of
settings. For example, --show-config args config will show both the processed
command-line arguments and the processed configuration.
The format used for showing the requested sets of settings can be controlled
with the --show-config-format option. By default, YAML (yaml) is preferred
for displaying the processed configuration, but the sets of settings will
instead be displayed as a dict of Python reprss (dict) by default when
args or original_args have been requested. In addition to the yaml and
dict formats, RNA-clique supports displaying sets of settings in JSON
(json). If yaml or json have been specified and the requested sets of
settings can not be serialized automatically in those formats, RNA-clique will
produce an error message.