Manual
Calculation by SOFA software of the SMILES based models involves the following:
First of all, special file that is a provider by data on compounds under consideration should be prepared. This file contains set of string organized as ‘Category’&’CAS number(or other identificator)’&’ SMILES’&’ Numerical values of endpoint’, for example:
+109-66-0 CCCCC -6.1
+110-54-3 CCCCCC -5.1
-111-65-9 CCCCCCCC -5.2
+26635-64-3 CC(C)CCCCC -5.2
-124-18-5 CCCCCCCCCC -4.7
+112-40-3 CCCCCCCCCCCC -3.5
#629-59-4 CCCCCCCCCCCCCC -4.3
#110-82-7 C1CCCCC1 -5.3
+493-01-6 C1CCC2CCCCC2C1 -3.3
+493-02-7 C1CCC2CCCCC2C1 -3.5
+137-43-9 BrC1CCCC1 -4.2
-542-18-7 ClC1CCCCC1 -4.1
-108-85-0 BrC1CCCCC1 -3.4

In other words input has the following format: one character that is indicator of category training set =’+’; calibration ‘-‘; test=’#’. For example, file #122 contains data on C60 solubility.
Limitation of number of strings is 300. Maximal length of the SMILES string is 200 symbols. Name of the file should be placed in SMILES-file box.

To obtain results the following steps are necessary:
- click “SMS->CW(SF)” button. SMILES string will be represented by system of SMILES fragments together with their correlation weights (start values of each correlation weight is zero);
- click “Optimiz CW(SF) button. By Monte Carlo method values of the correlation weights which produce as large as possible value of
 
RR=R2(training) + R2(calibration) – abs[ R2(training)-R2(calibration)].alpha

where R(training) and R(calibration) are correlation coefficients for endpoint under consideration and descriptor calculated as

DCW = S CW(SF)

for the training and calibration set
- click ‘Save CW(SF)’ for saving file (OPT) that contains values of optimal correlation weights of SMILES fragments. List of SMILES fragments is calculating by hierarchy: first, detecting of four characters fragments [O+], [N-]; second, detecting two characters fragments Cl, Br; and finally all others are one-character fragments.
- Click ‘Demo DCW’ for saving file (DCW) with demo of DCW calculation;
- Click ‘Demo MDL’ for saving file (MDL) with experimental and calculated values of the endpoint by E = C0 + C1 . DCW
Nit is total number of iterations for the optimization; ID for DEMO is number of string for demo DCW descriptor calculation (record in file DCW).

WARNING:
Splitting into training and test sets (as well splitting into training, calibration, and test) should satisfy to the following conditions:
- An interval of numerical values of endpoint should be similar for the training, calibration, (if any), and test sets;
- Number of substances in the training, calibration, and test sets should be reasonable large, e.g., for training minimum is 10; for calibration the minimum is 5; for test set the minimum is 3.
Certainly, the more number of substances the better. The reasonable splitting into training, calibration, and test sets in percents is 50:30:20.

qsar1007025.jpg
qsar1007024.jpg
qsar1007023.jpg
qsar1007022.jpg
qsar1007021.jpg
Overview
Manual
Snapshots
Download
How to cite?
qsar1007019.gif
qsar1007017.gif
qsar1007015.gif
qsar1007013.gif
qsar1007011.gif
qsar1007009.gif
qsar1007007.gif
qsar1007005.gif
qsar1007003.gif
qsar1007001.gif