Retaining Training Data Sets

As the use of Artificial Intelligence (AI) and machine learning methods expand in medical devices and HealthIT software, an oft asked question is whether the data sets used for training should be retained as part of the design history file (DHF) or other long term storage mechanisms.  SoftwareCPR partners Alan Kusinitz, Sherman Eagles, John Murray, and Brian Pate recently met to discuss this topic and arrived at several guiding principles that may be useful to manufacturers as they consider specific policy with regard to retaining training data sets.

We went into our roundtable discussion making the following assumptions:

  1. A trained model or algorithm represents a design output – produced by the activities and tasks of a development team and subject to Design Controls.
  2. Training set data is the method (or a portion of the overall method) by which the development team used to create the model or algorithm.
  3. Design input included the required patient population distribution (e.g., age range, sex, skin pigmentation, etc.) and quantity (number of data items), minimum accuracy (e.g., sensitivity, specificity), and other user controllable factors (e.g., if imaging, resolution, etc.)

One could consider model training to be a research activity – and thus, retain very little information in the DHF.  However, this would likely create an impediment to on-going development and improvements to the medical device or HealthIT system since the team would be “blind” to previous work.  So this leads to the question:  what would a “new” development team need from the previous development team to orchestrate further development and improvements to the system?  This question illustrates precisely one of the key purposes of a DHF.

Lean Product Development

If we approach the question from a lean product development viewpoint, we might re-frame the question as: what is the minimum amount of information a “new” development team would need from the previous development team to orchestrate further development and improvements to the system?  We considered this question at the roundtable and we arrived at this list:

  1. Source(s) of data items
  2. The number of data items
  3. How “ground truth” is annotated or associated with data items
  4. Patient population distribution
  5. Validation records

The assumption is that with this information, the manufacturer could re-create a new model or algorithm with equivalent performance as the original model, where equivalent performance is defined in design validation terms from the design input assumptions above.  By this approach and argument, we could envision an approach of not retaining the actual training set data in its original form.

We hopes this provides useful input to your planning for your AI/ML products.

About the author

Partner and General Manager, Brian Pate is ISO 1385:2016 Lead Auditor certified for Medical Device Quality Management Systems (MD), and ISO 19011:2018 Management Systems Auditing (AU) and Leading Management Systems Audit Teams (TL). Brian started his medical device career in anesthesia clinical research in 1985 and has since worked both academia and industry including many years with Johnson & Johnson, Baxter Healthcare, and GE Medical. Brian’s roles have included software engineering, systems engineering, quality assurance, and regulatory affairs. Brian has served on multiple AAMI TIR working groups, including TIR32-2008 (Application of ISO 14971 Risk Management to Software; now IEC 80002-1) and TIR45-2012 (Guidance on the use of Agile practices in the development of medical device software) and served as a reviewer for the 2nd edition of TIR45. Brian serves on the AAMI Software Committee and as an AAMI instructor for the software, design controls, and agile methods courses. Brian also is a member of the Underwriters’ Laboratories (UL) Standards Technical Panel for UL1998 (Software in Programmable Components) and or UL5500 (Remote Software Updates).

SoftwareCPR Training Courses

ISO13485:2016 ISO 13485 Internal Audit(or) Training Course (Live, 3-day)

IEC 62304 and other Emerging Standards Impacting Medical Device Software (Live, 3-day)

Being Agile & Yet CompliantISO 14971 SaMD Risk Management

Software Risk Management

Medical Device Cybersecurity

Software Verification

IEC 62366 Usability Process and Documentation

Or just email training@softwarecpr.com for more info.

Corporate Office

15148 Springview St.
Tampa, FL 33624
USA
+1-781-721-2921
Partners located in the US (CA, FL, MA, MN, TX) and Canada.