go back

Principles and Metrics for Curating Large Engineering Simulation Data Sets for ML

Jami J. Shah, Satchit Ramnath, Stefan Menzel, Thiago de Jesus de Araujo Rios, Fatma Kocer, Eamon Whalen, Joseph Pajot, Alex Adrian, Prakash Kumar, "Principles and Metrics for Curating Large Engineering Simulation Data Sets for ML", ASME Journal of Computing and Information Science in Engineering (JCISE), 2025.

Abstract

It is time to talk about data in its own right, not just its usage! Machine Learning applications are using a wide variety of data sources, some real, such as data collected by sensors and cameras in driving, and some artificial, such as data generated through numerical simulations. The latter mode has been gaining rapid popularity for engineering design and analysis. Early work in this arena seemed to center on the data being generated by developers of ML applications themselves. While the need for confidentiality of proprietary data may continue to drive this trend, we are seeing the beginnings of publicly shared data sets. Thus, the quality and efficacy of such data need to be considered before their use. In this paper, we attempt to outline systematic principles and quantifiable quality and efficacy metrics based on insights gained collectively from both data curation projects and usage of large engineering data sets. Specifically, this paper addresses issues related to generating BIG datasets from CAD and FEA: Granularity and Modality, applicable to both input and output data; Variety and Balance, applicable to input data; also, Efficacy for ML. Generation of Big Data by simulation requires the use of commercial CAD and FE packages, which poses multiple challenges: automation, integration and balancing sample variants. This paper proposes metrics for evaluating data efficacy: size, quality, validity and balance. We propose Parametric Variety and Balance Matrices to study over or under representation of input attributes. Even if we have a large data set with good balance, it may not be suitable for ML if we do not see significant variation in response or performance variables. Several data generation, curation and utilization case studies are included in a variety of domains (aero, structural, thermal, manufacturing) and usage of metrics is demonstrated.



Download Bibtex file Download PDF

Search

Cookies preferences

Others

Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.

Necessary

Necessary
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Advertisement

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Analytics

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Performance

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.