Simple ab initio materials data mining: tutorial

Published in

Materials Informatics Lab

9 min readFeb 7, 2016

Being in early beta, Tilde Python framework is steadily developed. Today we are going to have a look, what can be already done with its use. First, we prepare our first ab initio materials database out of the sample modeling data — feel free to use your own real data. Then, we construct a couple of SQL queries to extract some interesting bits from our new database.

Objectives

Our first task is to find the electronic band gap values obtained for the calculations with the minimal total electronic energy, in each distinct crystalline cell. Sounds reasonable, right?

Our second task extends the first one and is done for the binary compounds: considering also the periodic groups of the elements, to find the clusters of values of the periodic groups and band gaps. That is, we find all the minimum-energy band gaps and corresponding pairs of the chemical elements, forming a binary compound, and then, for the series of values “group of the first element — group of the second element — band gap” we apply the popular clustering algorithm called k-means. Hopefully, for enough reasonable data we produce the reasonable physics: e.g. we discover one cluster with the strongly ionic-bonded insulators, another cluster with the insulators more covalently bonded, third cluster with the semiconductors etc. On the other hand, obviously, it depends on the data we have — and on the number of the clusters we divide our data.

It should be noted though, that the details of the ab initio calculations are not taken into account in the present tutorial. Obviously, the criterion of the band gap selection out of the distinct crystalline cells only by the total electronic energy is very naive. One has to distinguish (at least) the electronic basis sets, potentials (i.e. DFT exchange-correlation functionals), reciprocal space integration samplings, and, finally, versions of the modeling packages. Only grouping the crystalline objects by those criteria and selecting a band gap from each group by the minimum energy is physically correct. However, we omit those additional selection criteria for the aims of our tutorial.

Let’s go then. You will need some background in chemistry, solid state physics, first-principles modeling, and informatics.

Materials Simulations plus Python

Tilde creates systematized repositories from the simulation logs of VASP, CRYSTAL and Quantum ESPRESSO ab initio electronic-structure modeling packages. The folders with the log files (e.g. *.cryst.out for CRYSTAL, vasprun.xml for VASP and *.out for QE) are scanned and the results are added into the database. The database is augmented with the user interface and data analysis tooling, and voilà, we have got a repository.

CRYSTAL, VASP and Quantum ESPRESSO: supported formats of the log files

Given the reach variety of modeling methods are employed and no format agreements exist, the log format of each materials science package is more or less unique. However, from the scientific point of view, there are always the common portions of information, called metadata, allowing to speak about the particular ab initio calculation. Metadata includes the crystalline object (e.g. the primitive cell of the silicon crystal under pressure), calculation setup (i.e. used approximation stack with the numerical precision, name and version of the simulation package), and, most important, the calculated physical quantities (e.g. the total electronic energy and band gap). In order to perform a calculation, one has to provide the starting conditions as an input file. Then, if the calculation succeeds, the results are written by materials science package into an output file (or files).

For the same calculation the outputs in different formats may coexist, fully or partially repeating the information. Thus, given the several output formats are available, the choice must be made, concerning the output containing the most of the information (called master output). Obviously, this is done to organize and minimize the parsing efforts. The master output is then the starting point of the parsing. If the information of interest is not extracted from the master output, other output files are checked, and if they also do not contain what we require, the calculation is considered as incomplete and rejected.

At the time only the Unix-like systems (Debian, Ubuntu, SLES etc.) are supported by Tilde framework. This is however not a problem, as the first-principles calculations are done in the Unix-like environments.

To start, please, download the framework archive from GitHub and unpack it. Note, that the system packages:

build-essential python-dev libffi-dev

(-dev or -devel) must be present in your system (normally, they are). The first contains C compiler, the second and third are required for Python extensions. Also, the Python numeric library:

python-numpy

should be present. To figure out, type in the console:

which python
python -c "from distutils.sysconfig import get_makefile_filename"
python -c "import numpy"

you should see the path of your Python executable and no errors further, indicating “Alles in Ordnung” with your Python. If not, please install the absent libraries or ask your system administrator for help.

The preferred way to install Tilde is to create a virtual environment, so that your main Python installation is not affected. Also in this way you do not need the root access to do the installation. Set up virtual environment inside the Tilde folder (note system-site-packages option to share access to python-numpy):

virtualenv --system-site-packages path/to/tilde_folder

Then activate virtual environment inside the Tilde folder and check again the Python executable path to make sure it has been changed:

cd path/to/tilde_folder
. bin/activate
which python

You have got a new separate Python now, so you can install or remove anything without the fear to mess system-wide things. Generally, virtual environment should always be used while working with the codebase. To install Tilde, run the following:

pip install -r requirements.txt

Again, if you see the errors, make sure to install C compiler and libffi-dev (or -devel), mentioned earlier. After all the required dependencies are downloaded and compiled, Tilde should be ready to use. Please, check it like this:

./utils/tilde.sh -x

In general, tilde.sh script is the central dispatcher (entry point) to manage the framework. There are different options supported, have a look with:

./utils/tilde.sh --help

Now, you may scan your folder with the calculations, or, alternatively, examples shipped with Tilde (tilde_folder/tests/data, although there are only several files). Also, as for now, please, do not scan huge folders with the terabytes of data using this version of Tilde, because this would take too much time. Of course, the folders with the data are sacred and are never written by Tilde, so you may re-scan them unlimitedly. Normally, a folder with 100–200 typical ab initio calculations takes 10–15 minutes to be scanned. Let’s start then:

./utils/tilde.sh /home/science/one /home/science/two -r -t -v -i -a

There are several command-line parameters used here: scan folders recursively (-r), with terse print (-t), showing detailed information for the found calculations (-i), including convergence (-v), and, finally, add results to a database (-a).

Now the information is extracted from the calculation logs and stored in a systematic way in the SQLite database tilde_folder/data/default.db. Right now we are not going to bother ourselves with the PostgreSQL database, although it is very easy to enable it (just editing Tilde settings file tilde_folder/data/settings.json).

You may check now how many calculations you have scanned:

./utils/tilde.sh -x

SQL comes into play

Now let us take advantage of our scanned data. Generally, we would like to request the different portions of information from several tables in our database and join these portions together in a way we prefer. This is better seen on an example. To explore SQLite database in tilde_folder/data/default.db you need a Firefox web browser with the SQLite Manager addon installed (it requires less than a minute to install). Just start Firefox, start the addon, choose default.db file and have a look at the tables inside. All together there are 28 tables presenting the database schema. You may see, that e.g. the total electronic energy is stored in the energies table, calculation metadata — in the metadata, information about the crystalline cell (i.e. chemical formula) — in the struct_ratios, band gap values — in the electrons etc. In these tables each row stands for a particular calculation. We are now primarily interested in the information stored in these mentioned 4 tables.

Please, navigate to the folder tilde_folder/tutorials/simple_data_mining. There are two miner_*.py scripts which we are going to use for our simple data mining. In the script miner_bandgaps.py we collect all the relevant pieces of information from these 4 tables and join them according to the following conditions:

each calculation has the chemical formula, crystalline cell, band gap, and total electronic energy,
calculations must be grouped by the chemical formula and the number of formula units per crystalline cell,
from each group one calculation with the minimum total electronic energy must be selected,
given this calculation does not produce the conducting state, the band gap, chemical formula and (additionally) the full path of the master output must be shown.

All this is done entirely in SQL at the level of the database. But we currently deal with Python — and what if we do not want to bother with SQL right now? This is where ORM (object-relational mapping), employed in Tilde, helps us. Look, there is no SQL code in miner_bandgaps.py. Instead, we have a cycle over the ORM model objects (i.e. database tables). Although the ORM may seem an over-complication (and it really is), let us agree for now: its advantages overweight its disadvantages. Thus we mimic the database queries with the special Python objects — this is a good demo for the aims of this tutorial.

We should obtain an output similar to the shown below:

Clustering with the k-means algorithm

Given a set of points (in our case, in a three-dimensional space), k-means aims to partition the points into a number of sets, in order to minimize the within-cluster sum of distance functions of each point to the cluster center. In our case (non-conducting binary compounds), each point is given by the values: x, group of the first element — y, group of the second element — z, band gap.

As said, miner_kmeans.py partly repeats the functionality of miner_bandgaps.py. We see mostly the same cycle over the model objects there. The difference is that we ask our database only for the binary compounds. Using ORM it looks like:

.filter(model.Struct_ratios.nelem == 2)

But then, instead of simple printing, we collect the results (i.e. the element groups and respective band gaps) in the Python data list and execute the clustering:

clusters = kmeans(data, k_from_n(len(data)))

K-means algorithm does not provide the number of clusters to divide our data. We try to guess this number from the size of our data naively: see k_from_n function in the file kmeans.py. In principle, it is up to the reader to decide if this guess makes sense. If the scanned calculations were done for the similar crystals, the choice of the number of clusters is likely to be wrong. Probably, in such case all the calculations should be treated as a single cluster. Otherwise, for heterogeneous data (i.e. many different crystals) the automatically chosen number of clusters should make sense.

As a result we should obtain two files in the visualization sub-folder: points.csv and clusters.csv. The former one contains all the values before clustering, and the latter one contains these values grouped in the clusters. Together with these files, there is an HTML5 webpage for visualization. If we open it in the browser, we should see something similar to http://tilde-lab.github.io/simple-k-means-visualization. Note though, that the double clicking on this webpage does not work, as you need a web-server to load properly all the web-components. Python provides such a web-server, if you use the following command (inside the folder tilde_folder/tutorials/simple_data_mining/visualization):

python -m SimpleHTTPServer

and then point your web-browser to http://localhost:8000. An online example above was produced from a set of some old calculations and contains 4 clusters (see below). Looks reasonable, does not it?

Example k-means clustering for a set of simulated binary compounds — does it look similar to your data?

Summary

In this tutorial we have built the database of the ab initio materials science calculations and extracted some meaningful information as an example. Although not really new, this information is very time-consuming to be extracted in traditional ways (grep, manual keeping etc.), and the database approach provides a robust and convenient way of journalizing the scientific information. We also have got acquainted with the popular clustering algorithm, able to provide a fresh sight to our data.

Tilde framework code repository: https://github.com/tilde-lab/tilde
A separate k-means visualization code repository: https://github.com/tilde-lab/simple-k-means-visualization
The same as above, as the single ready-to-view webpage: http://tilde-lab.github.io/simple-k-means-visualization
Some interesting open-access ab initio simulation data: https://github.com/zhongnanxu/rutile-OER

Let us hack the materials?