A package to evaluate structural models using chemical crosslinking distance constraints.

Institute of Chemistry
University of Campinas

XL Statistics
See also:
Dalton MS lab
Group software page
The TANGO project
Home Server Download

Tools to analyse TopoLink results

TopoLink comes with some helper programs that are used to analyze the results in large sets of structural models. These packages are available upon installation of TopoLink, and are:

1. evalmodels: A package that reads sets of TopoLink output files and model quality scores from another source, and writes tables to for plotting their correlation.
2. linkcorrelation: A package to analyze the correlation between crosslinks in ensembles of models.
3. linkensemble: A package to read TopoLink output data and compute the set of models required to satisfy the observed links.
Appendix: Evaluating structural models with LovoAlign [click here].

These tools will be deprecated in favor of the Julia implementation of the analysis suite, which is already functional and available at:


1. evalmodels

evalmodels is simply a package to read the output of several TopoLink output files, and some other file containing a model evaluation score, and output the list of models with the crosslinking statistics associated with this score.

For example, the plot below was obtained from the output of evalmodels:

The values in the y-axis, i. e. the number of observed links that consistent with each structure, were obtained from the TopoLink log files of each model. The values in the x-axis, in this case the similarity of each model to the crystallographic structure, were obtained by aligning each model with the crystallographic model, in this case using LovoAlign (click here for details).

evalmodels is executed as follows:
evalmodels loglist.txt scores.dat output.dat -m1 -c2
where loglist.txt is a list of TopoLink log files, in the following form:
scores.dat is a table containing the name of the models (or model files) and the third-party score that will be used, for example:
model1   0.754
model2   0.321
model3   0.135
and, finally, the -m1 and -c2 flags indicate the column in scores.dat containing the name of the model and the value of the score, respectively (in the example, 1 and 2). The name of the models might be file names, only the base name will be considered, i. e. "model1", and must coincide with the base name of the corresponding TopoLink log file ("model1.log").

The output file output.dat, has the following structure: # TopoLink # # EvalModels output file. # # Log file list: log.list # Score (possibly LovoAlign log) file: ../analysis/cristal.log # Number of models 11001 # # Score: Model quality score, obtained from column 8 of the score file. # # RESULT0: Number of consistent observations. # RESULT1: Number of topological distances consistent with all observations. # RESULT2: Number of topological distances NOT consistent with observations. # RESULT3: Number of missing links in observations. # RESULT4: Number of distances with min and max bounds that are consistent. # RESULT5: Sum of the scores of observed links in all observations. # RESULT6: Likelihood of the structural model, based on observations. # # More details at: # # Score RESULT0 RESULT1 RESULT2 RESULT3 RESULT4 RESULT5 RESULT6 MODEL 69.59000 16 16 10 10 18 0.00000 0.10000E+01 S_00093408 65.48500 16 16 10 11 21 0.00000 0.10000E+01 S_00090481 63.06000 17 17 9 23 0 0.00000 0.99996E+00 S_00108183 ...
The first column is the score read from the scores.dat file. The other columns contain the different statistics of the crosslinks for each model, as described, to be associated with the score of the first column, using any plotting software.

2. linkcorrelation

linkcorrelation is a package to compute the correlation between the satisfaction of links in ensembles of structural models. It produces, as output, a matrix of correlations, containing either: 1. The fraction of structures that satisfy both links simultaneously. 2. The fraction of structures that do not satisfy both links simultaneously. 3. The fraction of structures that satisfy either one or other link. 4. The correlation of the crosslink pair: That is, a score in the interval [-1,1] which is -1 if the links are anti-correlated, and 1 if they are correlated.
Running linkcorrelation: The package must be run with:
linkcorrelation loglist.txt -type [type]
where loglist.txt is a file containing a list of all TopoLink log files to be considered, and [type] is an integer number with value 1 to 4, according to the desired type of output, as described above.

The loglist.txt file must be of the form:
For example, these to correlation plots were produced with the output of linkcorrelation:

Click on the image to open a high resolution image.

Plot A, on the left, was generated with the "-type 1" option, and shows the fraction of structures of structures of the set satisfying both crosslinks of the matrix at the same time. In particular, the diagonal contains the fraction of structures that satisfy each specific crosslink.

Plot B, on the right, was generated with the "-type 3" option, and shows the fraction of structures that satisfy one link OR the other, exclusively (if both links are satisfied, the model is not counted). This plot shows anti-correlations between the links. The diagonal, in this case, is null, because the each crosslink is obviously completely positively correlated with itself.

These plots were generated from the output of linkcorrelations using the following python/matplotlib script:

3. linkensemble

linkcorrelation computes minimum and optimal set of models required to satisfy the observed links. For example, if one has observed 26 experimental crosslinks, it is quite typical that no model accounts for all 26 crosslinks at the same time. Therefore, one wishes to find a set of models representing some conformational variability that takes into account all, or at least most, of the observed links.

The linkensemble depends on some quality measure for the models, in the following form:
# Coments
100.00  S_00093408.pdb
18.704  S_00000001.pdb
33.889  S_00000002.pdb
Lets call this file scores.dat. The quality score might be some modeling score, the G-score (the output of G-score is provided already in the correct format), or a measure of similarity to a reference model, for example.

With the scores.dat file in hand, linkensemble is run with:
linkensemble loglist.txt scores.dat linkensemble.dat
This will produce a file with the following form: # TopoLink # # LinkEnsemble output file. # # Log file list: ../log.list # G-score file: S_00093408_align.dat # Number of models 11001 # Number of observed crosslinks: 26 # # 1 MET A 1 LYS A 17 # 2 MET A 1 LYS A 113 # 3 LYS A 6 GLU A 9 ... # 26 LYS A 113 SER A 116 # # Nmodel: Number of crosslinks satisfied by this model. # RelatP: Relative probability of this model (G-score ratio to best model). # DeltaG: RelatP converted to DeltaG (kcal/mol). # Ntot: Total number of links satisfied by the ensemble up to this model. # Next: link indexes according to list above. # # Model Nmodel RelatP DeltaG Ntot 1 2 3 4 ... 26 1 S_00093408 16 1.00000 -0.00000 16 0 1 1 1 ... 0 2 S_00037416 20 0.68889 0.22067 20 1 1 1 1 ... 1 3 S_00060737 17 0.65370 0.25172 21 1 1 1 1 ... 1 ...
This file contains a list of the observed links, and a list of the models, containing the following data: The index of the model, ordered by greatest to lowest score; the number of links satisfied by the model, the relative score of each model to the best model (which is called RelatP because the suggested G-score is a probability); the corresponding ΔG if the score is a probability; the total number of links cumulatively satisfied by the set of models up to that model, and the list of links satisfied or not (1 or 0) by the set.

For example, the output above indicates that the first model satisfies 16 links. The second satisfies 20 links, and the third satisfied 17 links (third column). If the three models are taken in consideration, 21 links can be observed. The list of links follow each model.