Retaining Training Data Sets

As the use of Artificial Intelligence (AI) and machine learning methods expand in medical devices and HealthIT software, an oft asked question is whether the data sets used for training should be retained as part of the design history file (DHF) or other long term storage mechanisms.  SoftwareCPR partners Alan Kusinitz, Sherman Eagles, John Murray, and Brian Pate recently met to discuss this topic and arrived at several guiding principles that may be useful to manufacturers as they consider specific policy with regard to retaining training data sets.

We went into our roundtable discussion making the following assumptions:

  1. A trained model or algorithm represents a design output – produced by the activities and tasks of a development team and subject to Design Controls.
  2. Training set data is the method (or a portion of the overall method) by which the development team used to create the model or algorithm.
  3. Design input included the required patient population distribution (e.g., age range, sex, skin pigmentation, etc.) and quantity (number of data items), minimum accuracy (e.g., sensitivity, specificity), and other user controllable factors (e.g., if imaging, resolution, etc.)

One could consider model training to be a research activity – and thus, retain very little information in the DHF.  However, this would likely create an impediment to on-going development and improvements to the medical device or HealthIT system since the team would be “blind” to previous work.  So this leads to the question:  what would a “new” development team need from the previous development team to orchestrate further development and improvements to the system?  This question illustrates precisely one of the key purposes of a DHF.

Lean Product Development

If we approach the question from a lean product development viewpoint, we might re-frame the question as: what is the minimum amount of information a “new” development team would need from the previous development team to orchestrate further development and improvements to the system?  We considered this question at the roundtable and we arrived at this list:

  1. Source(s) of data items
  2. The number of data items
  3. How “ground truth” is annotated or associated with data items
  4. Patient population distribution
  5. Validation records

The assumption is that with this information, the manufacturer could re-create a new model or algorithm with equivalent performance as the original model, where equivalent performance is defined in design validation terms from the design input assumptions above.  By this approach and argument, we could envision an approach of not retaining the actual training set data in its original form.

We hopes this provides useful input to your planning for your AI/ML products.

About the author

Brian is a biomedical software engineer - whatever that is! Started writing machine code for the Intel 8080 in 1983. Still enjoys designing and developing code. But probably enjoys his garden more now and watching plants grow ... and grandkids grow!

SoftwareCPR Training Courses:

IEC 62304 and other emerging standards for Medical Device and HealthIT Software

Our flagship course for preparing regulatory, quality, engineering, operations, and others for the activities and documentation expected for IEC 62304 conformance and for FDA expectations. The goal is to educate on the intent and purpose so that the participants are able to make informed decisions in the future.  Focus is not simply what the standard says, but what is meant and discuss examples and approaches one might implement to comply.  Special deep discount pricing available to FDA attendees and other regulators.

3-days onsite with group exercises, quizzes, examples, Q&A.

Instructor: Brian Pate

Next public offering:  TBD

Email to request a special pre-registration discount.  Limited number of pre-registration coupons.

Registration Link:




Being Agile & Yet Compliant (Public or Private)

Our SoftwareCPR unique approach to incorporating agile and lean engineering to your medical device software process training course is now open for scheduling!

  • Agile principles that align well with medical
  • Backlog management
  • Agile risk management
  • Incremental and iterative software development lifecycle management
  •  Frequent release management
  • And more!

2-days onsite (4 days virtual) with group exercises, quizzes, examples, Q&A.

Instructors: Mike Russell, Ron Baerg

Next public offering: March 7 & 28, 2024

Virtual via Zoom

Registration Link:

Register Now



Medical Device Cybersecurity (Public or Private)

This course takes a deep dive into the US FDA expectations for cybersecurity activities in the product development process with central focus on the cybersecurity risk analysis process. Overall approach will be tied to relevant standards and FDA guidance documentation. The course will follow the ISO 14971:2019 framework for overall structure but utilize IEC 62304, IEC 81001-5-1, and AAMI TIR57 for specific details regarding cybersecurity planning, risk characterization, threat modeling, and control strategies.

2-days onsite with group exercises, quizzes, examples, Q&A.

Instructor: Dr Peter Rech, 2nd instructor (optional)

Next public offering:  TBD

Corporate Office

15148 Springview St.
Tampa, FL 33624
Partners located in the US (CA, FL, MA, MN, TX) and Canada.