Support

Support Options

Report a problem

Machine Learning Toolkit

We have developed a web based ML toolkit MaDE@UB, which enables researchers to apply ML techniques to a wide variety of materials science data and extract unique insights. MaDE@UB extends the capabilities of ChemML - a machine learning and informatics program suite for the chemical and materials sciences - by the addition of two major resources: a Python-based feature extraction library for inorganic materials thereby empowering a broader range of materials scientists and the development of an intuitive GUI that makes it convenient to use as well as easy to learn.

Our ML toolkit has been released and is available here. Read more about the design of our toolkit below.

Fig. 1. Overview of the MaDE@UB ML toolkit. Blocks with yellow background denote a general purpose machine learning pipeline. Blocks with blue background denote various aspects of ChemML[1] and blocks with gray background denote the additional capabilities that MaDE@UB will have on top of ChemML.

In this following, we provide a brief overview of ChemML, the Python-based feature extraction library for inorganic materials and the GUI developed as part of MaDE@UB. In addition, we briefly mention a few types of feature extraction methods or features that are available as part of MaDE@UB as well as use-cases from literature where these have been employed to predict various properties.

ChemML

ChemML[1] is a machine learning (ML) and informatics program package for the validation, analysis, mining, and modeling of large-scale chemical data sets. The overall design of ChemML (shown in Fig. 1) consists of two main frameworks called ChemML Wrapper and ChemML Library. ChemML Wrapper is a flexible and versatile framework to compile a diverse collection of techniques from different sources. ChemML Library is a python library to house existing methods for which implementations are not (publicly) available and new methods that are being developed in our group. The current version of ChemML includes implementation of feature representation methods (descriptors) like coulomb matrix[2] and bag-of-bonds[3], interfaces for RDKit* fingerprints and Dragon molecular descriptor software^, and Python translation of inorganic descriptors available in the Magpie[4] library. We also provide majority of data manipulation, preprocessing, modeling, and even evolutionary search algorithms. In total ChemML Wrapper offers 94 functions and classes from its own library or from external packages including Pandas[5], Scikit-learn[6], Tensorflow[7], Keras[8], RDKit, OpenBabel[5], Dragon, Deap[9], and Matplotlib[10]. All these functions are categorized by their host name and the seven major tasks, i.e., (1) enter, (2) represent, (3) prepare, (4) model, (5) search, (6) visualize, and (7) store. ChemML Wrapper allows us to build and run an arbitrary graph of operations via configuration files. Since computation graphs can become very complex, we have also developed a graphical user interface using Jupyter notebooks and Ipywidgets to create, modify, and verify configuration files. Fig. 3 shows a graph visualization of a ML work-flow designed by the wrapper GUI. These types of graphs will be subsequently saved as a configuration file to run on any system that ChemML has been installed. More customizable templates are provided in the wrapper GUI to fulfill different data mining purposes.

 

Fig. 2. Design scheme of the ChemML.

Fig. 3. A sample wrapper GUI workflow that includes computation blocks to load data, generate bag-of-bonds descriptors and plot properties.

The intuitive and easy-to-use GUI of MaDE@UB makes the toolkit an extremely versatile resource for research and education as well as easily approachable for materials scientists with little to no ML expertise. With a low entry-barrier by construction, it encourages more researchers to design new and efficient materials quicker, thus reducing the materials development life cycle.

Augmentations to ChemML:

Feature extraction library for inorganic materials

Based on the literature survey, we identified that Java-based Materials-Agnostic Platform for Informatics and Exploration (Magpie[4]) had several useful feature extraction methods for inorganic materials. For integration with MaDE@UB and also because it would provide seamless access to a plethora of already existing machine learning libraries, we decided to convert a total of 23 feature extraction classes of Magpie into Python. Further, CompositionEntry and CrystalStructureEntry were identified as the core data structures on which the feature extraction classes depended and hence, they were converted first. Of the 23, 12 are composition-based and 11 are crystal structure-based feature extraction methods from literature. In addition, several helper classes from the Java-based Versatile Atomic Scale Structure Analysis Library (Vassal\footnote{https://bitbucket.org/wolverton/vassal}) were also converted to Python.

Composition-based features:

Input for the composition-based features is in the form of the absolute path to plain-text files containing the chemical formula in multiple formats. Two examples formats (file contents) are shown below:

NaCl Na,1.0,Cl,1.0
BaSO4 Ba,1.0,S,1.0,O,4.0
Fe2O3 Fe,2.0,O,3.0

Based on the composition information and the corresponding property lookup-tables, different features are extracted by different classes. The simplest types of features based on the compositions are the element fractions and their various norms. More complicated features involve weighing various properties like the atomic mass, number of valence electrons in the s, p, d and f shells, atomic and covalent radius etc. according to their atomic fractions present in the composition and computing statistics such as the mean, range (max - min) etc.

Atomic-fraction-weighted statistics of 9 properties (including atomic mass, electronegativity, atomic radius etc.) as well as element fractions were used as features to predict thermodynamic stability[11] of various ternary compounds. Atomic fraction-weighted-statistics of 23 properties (including melting point, band gap energy of 0K ground state, covalent radius etc.), various norms of element fractions, the number of valence electrons and features based on ionicity[12] were used to predict novel solar cell materials and suggest new glass-forming alloys[4]. Miracle radii[13,14] of elements were used to develop a predictive structural model for bulk metallic glasses[15]. Atomic-fraction-weighted statistics of oxidation states, electronegativity and electron affinity of atoms were used to predict the total energies and formation enthalpies of metal-nonmetal compounds[24]. Features developed by Yang et al.[16] involving the mixing enthalpy of liquids[17], Miracle radii[13,14] and the melting temperature were used to predict whether a metal alloy will form a solid solution of bulk metallic glass.

Crystal structure-based features:

Input for the crystal structure-based features is in the form of the absolute path to the directory containing crystal structures represented using the popularly used VASP format. An example (file contents) is shown below:

C

1.0

-2.056022 -2.056022 -2.056022

-2.056022 2.056022 2.056022

2.056022 2.056022 -2.056022

C

4

direct

0.7500017023 0.0000000000 0.0000000000

0.0000000000 0.5000019455 0.2499997568

0.7500017023 0.7500017023 0.2500021887

0.5000019455 0.7500017023 0.5000019455

All crystal structures are represented as Cells or Voronoi Cells and a Voronoi tessellation is performed prior to generating most of the features. Crystal structure-based features involve structure dependent factors such as the radial distribution function, coordination numbers, nearest neighbor distances etc.

Radial Distribution Function (RDF) defines the number of atoms present at a distance r to r + dr from an atom's position. Atomic property-weighted radial distribution functions[18] were used to predict gas uptake capacity of metal-organic frameworks. Metal-organic frameworks are useful in many applications including water splitting, solar cells, CO2 reduction, Li-ion batteries, supercapacitors and fuel cells[19]. Partial radial distribution function developed by Schutt et al.[20] was used to predict density of states at the Fermi energy.

Coordination number of an atom denotes the number of its nearest neighbors. Features depending on the coordination numbers of the elements present in the compound include various statistics of these numbers, average Warren-Cowley[21] ordering parameter for the bond network and face-size-weighted coordination number. Closely related to coordination number features is the maximum packing efficiency feature. This is computed by finding the largest sphere that would fit inside each Voronoi cell and comparing the volume of that sphere to the volume of the cell. Features like the eigenvalues of an approximation to the Coulomb matrix[22] that considers periodicity were used to predict formation energies of solids. Features based on the heterogeneity of the structure that can be generated include various statistics of the bond length and the cell volume. Features based on the coordination polyhedron in the structure that can be generated include the similarity of the crystal structure to simple lattices like cubic, body-centered-cubic and face-centered-cubic. Features based on local effects between neighboring atoms that can be generated include various statistics on a set of elemental properties. Ward et al.[23] have combined composition-based features and most of the crystal structure-based features to predict formation energies of more than 400,000 inorganic compounds.

Innovative Graphical Work-flow driven User Interface

Fig. Overview of the GUI Platform. The GUI is used to build a Graph that describes the work-flow. The user can directly interact with the graph with context menus and pop-up dialog boxes.

An easy-to-use GUI has been developed to reduce the time it takes to build prototypes for ML models as well as experiment with various feature extraction methods available as part of MaDE@UB. Further, plug-and-play-style interactions for the different blocks associated with a typical ML pipeline as well as the graphical representation with directed edges, make the GUI intuitive to use and easy to learn. Additionally, because of the modular nature of the blocks involved, the GUI is highly customizable and users can build as complex a pipeline as needed. It also includes various built-in templates (short ML pipeline examples) as well as tutorials to help the user better understand the various aspects of MaDE@UB.

The GUI Platform would allow users to create ML pipeline by defining the work-flow as a Graph. The user could interact with the graph with context menus options available by right-clicking on the graph itself. The user would then be able to create and modify the different nodes in the work-flow. The user would also be able to create and define the data flow between nodes by simply dragging and releasing the edge-handles present on the node, and then defining the parameters in the pop-up dialog box.

Fig. Overview of the GUI Application Architecture. User creates and submits the model to the system, which is processed and stored and later used to visualize the results to the user.

Once the user creates the work-flow he can then submit it to the system that is then scheduled by the scheduler based on the resources available. The results of the process is then stored in a database to be later retrieved by the GUI and shown to the user.

The intuitive nature of the GUI helps a beginner to setup a relatively complex model  and gain valuable insights into the problem without actually needing to know and learn in detail, the different aspects of implementation of an ML pipeline.

Use cases of ML in Materials Sciences:


1. Journal of Chemical Information and Modeling: We have developed in this paper a new way of representing data embedded in materials databases. Using the formalism of Topological Data Analysis, we show how one can automatically explore potential correlations and connectivity between a compounds in a chemical library. The information is presented as a "barcode" (or feature vector) which dynamically maps the evolution of similarity between materials chemistries. The work provides a new "building block" for querying databases and can be scaled up to very large databases. Also this approach can fuse diverse types of data from different types of databases and can hence serve as a way to link our MaDE@UB system with other DIBBS efforts.

2. Molecular Systems Design and Engineering:  This paper introduces a new approach to building computational materials science based libraries that reflect fundamental electronic and crystal structure information. Here we show how by applying computational physics methods on that crystallographic data, we can produce a visualization scheme that captures interaction between chemical bonding and crystal structure called Hirschfeld surfaces. These 3 dimensional structures becomes the new chemical library on which we apply machine learning methods. Using a class of complex compounds known as "metal-organic framework" structure we show how we can map and direct the chemical design of materials, in a way that has not been done before.

3. Nature Scientific Reports : ML-based chart infographics tools developed under the grant allow us to automatically extract contours of phase boundaries and descriptive features of invariant points in phase diagrams, such as eutectics. As a case study we have applied it to the problem of assessing the potential glass forming ability of binary alloys. We show that the application of such automation methods can actually guide us to identifying new metallic glass chemistries. As a next step we are expanding this to a broad range of diagrams, such as TTT curves, polarization curves etc.


Take away points:

  • In all these cases, the major theme of our work is the discovery of new building blocks for materials science databases and hence that has a major impact on the domain science.
  • The introduction of topological data analysis (TDA) in materials science is very new and we are one of only a handful of groups in the materials science arena exploring this approach. Also, this along with the use of manifold representation of databases, is domain agnostic and hence adds a new resource to the MaDE@UB toolkit
  • While there are a number of groups applying text mining techniques on materials science papers to uncover interesting concepts, ours is the first to introduce the idea of using automated learning and interpretation of scientific diagrams in materials science.  This has ramifications in many directions for materials research, especially in being able to harness the vast amounts of engineering data and their associated science that is embedded in published materials science literature and mostly untapped.

References

  1. M. Haghighatlari, R. Subramanian, B. Urala Kota, G. Vishwakarma, A. Sonpal, P. H. Chen, S. Setlur, and J. Hachmann, “Chemml: A machine learning and informatics program suite for the chemical and materials sciences.” https://github.com/hachmannlab/chemml, 2018.

  2. M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Phys. Rev. Lett., vol. 108, p. 058301, Jan 2012.

  3. K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.-R. Müller, and A. Tkatchenko, “Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space,” The Journal of Physical Chemistry Letters, vol. 6, no. 12, pp. 2326–2331, 2015. PMID: 26113956.

  4. L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, “A general-purpose machine learning framework for predicting properties of inorganic materials,” npj Computational Materials, vol. 2, p. 16028, aug 2016.

  5. W. McKinney, “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference (S. van der Walt and J. Millman, eds.), pp. 51 – 56, 2010.

  6. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

  7. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke,
    V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016.

  8. F. Chollet et al., “Keras.” https://keras.io, 2015.

  9. F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, and C. Gagné, “DEAP: Evolutionary algorithms made easy,” Journal of Machine Learning Research, vol. 13, pp. 2171–2175, jul 2012.

  10. J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing In Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.

  11. B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. W. Doak, A. Thompson, K. Zhang, A. Choudhary, and C. Wolverton, “Combinatorial screening for new materials in unconstrained composition space with machine learning,” Phys. Rev. B, vol. 89, p. 094104, Mar 2014.

  12. W. D. C. Jr. and D. G. Rethwisch, Materials Science and Engineering: An Introduction. Wiley, 2013.

  13. D. B. Miracle, E. A. Lord, and S. Ranganathan, “Candidate atomic cluster configurations in metallic glass structures,” MATERIALS TRANSACTIONS, vol. 47, no. 7, pp. 1737–1742, 2006.

  14. D. B. Miracle, D. V. Louzguine-Luzgin, L. V. Louzguina-Luzgina, and A. Inoue, “An assessment of binary metallic glasses: correlations between structure, glass forming ability and stability,” International Materials Reviews, vol. 55, pp. 218–256, jul 2010.

  15. K. J. Laws, D. B. Miracle, and M. Ferry, “A predictive structural model for bulk metallic glasses,” Nature Communications, vol. 6, sep 2015.

  16. X. Yang and Y. Zhang, “Prediction of high-entropy stabilized solid-solution in multi-component alloys,” Materials Chemistry and Physics, vol. 132, pp. 233–238, feb 2012.

  17. A. Takeuchi and A. Inoue, “Classification of bulk metallic glasses by atomic size difference, heat of mixing and period of constituent elements and its application to characterization of the main alloying element,” MATERIALS TRANSACTIONS, vol. 46, no. 12, pp. 2817–2829, 2005.

  18. M. Fernandez, N. R. Trefiak, and T. K. Woo, “Atomic property weighted radial distribution functions descriptors of metal–organic frameworks for the prediction of gas uptake capacity,” The Journal of Physical Chemistry C, vol. 117, no. 27, pp. 14095–14105, 2013.

  19. H. Wang, Q.-L. Zhu, R. Zou, and Q. Xu, “Metal-organic frameworks for energy applications,” Chem, vol. 2, pp. 52–80, jan 2017.

  20. K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller, and E. K. U. Gross, “How to represent crystal structures for machine learning: Towards fast prediction of electronic properties,” Phys. Rev. B, vol. 89, p. 205118, May 2014.

  21. J. M. Cowley, “An approximate theory of order in alloys,” Physical Review, vol. 77, pp. 669–675, mar 1950.

  22. F. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, “Crystal structure representations for machine learning models of formation energies,” International Journal of Quantum Chemistry, vol. 115, pp. 1094–1101, apr 2015.

  23. L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary, and C. Wolverton, “Including crystal structure attributes in machine learning models of formation energies via voronoi tessellations,” Physical Review B, vol. 96, jul 2017.

  24. A. M. Deml, R. O’Hayre, C. Wolverton, and V. Stevanović, “Predicting density functional theory total energies and enthalpies of formation of metal-nonmetal compounds by linear regression,” Physical Review B, vol. 93, feb 2016.