Calculation by SOFA software of the SMILES based models involves the following:
First of all, special file that is a provider by data
on compounds under consideration should be prepared. This file contains set of string organized as ‘Category’&’CAS number(or other
identificator)’&’ SMILES’&’ Numerical values of endpoint’, for example:
+109-66-0 CCCCC -6.1
+110-54-3 CCCCCC -5.1
-111-65-9
CCCCCCCC -5.2
+26635-64-3 CC(C)CCCCC -5.2
-124-18-5 CCCCCCCCCC -4.7
+112-40-3 CCCCCCCCCCCC -3.5
#629-59-4 CCCCCCCCCCCCCC -4.3
#110-82-7
C1CCCCC1 -5.3
+493-01-6 C1CCC2CCCCC2C1 -3.3
+493-02-7 C1CCC2CCCCC2C1 -3.5
+137-43-9 BrC1CCCC1 -4.2
-542-18-7 ClC1CCCCC1 -4.1
-108-85-0
BrC1CCCCC1 -3.4
…
In other words input has the following format: one character that is indicator of category training set =’+’; calibration
‘-‘; test=’#’. For example, file #122 contains data on C60 solubility.
Limitation of number of strings is 300. Maximal length of the
SMILES string is 200 symbols. Name of the file should be placed in SMILES-file box.
To obtain results the following steps are necessary:
-
click “SMS->CW(SF)” button. SMILES string will be represented by system of SMILES fragments together with their correlation weights
(start values of each correlation weight is zero);
- click “Optimiz CW(SF) button. By Monte Carlo method values of the correlation
weights which produce as large as possible value of
RR=R2(training) + R2(calibration) – abs[ R2(training)-R2(calibration)].alpha
where R(training) and R(calibration) are correlation coefficients for endpoint under consideration and descriptor calculated as
DCW
= S CW(SF)
for the training and calibration set
- click ‘Save CW(SF)’ for saving file (OPT) that contains values of optimal correlation
weights of SMILES fragments. List of SMILES fragments is calculating by hierarchy: first, detecting of four characters fragments [O+],
[N-]; second, detecting two characters fragments Cl, Br; and finally all others are one-character fragments.
- Click ‘Demo DCW’ for
saving file (DCW) with demo of DCW calculation;
- Click ‘Demo MDL’ for saving file (MDL) with experimental and calculated values of
the endpoint by E = C0 + C1 . DCW
Nit is total number of iterations for the optimization; ID for DEMO is number of string for demo
DCW descriptor calculation (record in file DCW).
WARNING:
Splitting into training and test sets (as well splitting into training,
calibration, and test) should satisfy to the following conditions:
- An interval of numerical values of endpoint should be similar
for the training, calibration, (if any), and test sets;
- Number of substances in the training, calibration, and test sets should
be reasonable large, e.g., for training minimum is 10; for calibration the minimum is 5; for test set the minimum is 3.
Certainly,
the more number of substances the better. The reasonable splitting into training, calibration, and test sets in percents is 50:30:20.