This thesis explores the application of BioBERT, a specialized BERT model, for the identification of polyphenol and food entities in scientific literature. The primary objective of this dissertation is to automate the extraction of polyphenol content from vast collections of scientific texts using Named Entity Recognition (NER) techniques, thereby addressing the significant challenge of managing extensive scientific data. By leveraging state-of-the-art deep learning and natural language processing methodologies, the study aims to enhance the accuracy and efficiency of data extraction processes, contributing valuable insights to nutritional research and supporting the maintenance of comprehensive nutritional databases such as Phenol-Explorer.
The methodology involved collecting a diverse dataset of scientific papers related to polyphenol content in foods and preprocessing this data to ensure suitability for training and evaluation. For this very purpose, a deep learning model was designed and trained to accurately extract polyphenol and food entities. Initial experiments without optimization revealed significant overfitting, which was mitigated by the implementation of the AdamW optimizer. Data augmentation techniques further enhanced the model’s robustness by increasing the diversity of training samples, thereby improving its resilience to noisy inputs and minor textual variations.
The results demonstrate that the BioBERT model, combined with advanced NER techniques, effectively detects polyphenol and food entities with high accuracy, achieving precision, recall and F1-scores around 0.98, and an overall accuracy of 0.99. The successful application of the AdamW optimizer significantly reduced overfitting, as evidenced by the consistent decrease in validation loss across trials. These outcomes validate the model’s capability to generalize well to unseen data, underscoring its robustness and adaptability in handling complex biomedical information.
The contributions of this research extend beyond technical achievements. By automating the extraction of polyphenol data, the study addresses the ongoing challenge of data overload in scientific research, ensuring that critical insights are efficiently extracted and used. The establishment of a continuous update mechanism for the Phenol-Explorer database supports the maintenance of accurate and relevant nutritional data, thereby enhancing the impact of polyphenol research on health and nutrition science.
In conclusion, this thesis not only achieves its initial objectives but, also lays a solid foundation for future advancements in leveraging artificial intelligence for nutritional research. The integration of BioBERT through rigorous scientific methodologies exemplifies the transformative potential of AI in advancing our understanding of nutritional sciences and supporting evidence-based practices. Future research directions include expanding the dataset, incorporating relation extraction methods to identify interactions between polyphenols and foods, and quantifying polyphenol concentrations in foods. These advancements will further enhance the model’s capabilities, enabling more detailed and automated information extraction from scientific papers, ultimately contributing to the advancement of nutritional research and the broader application of AI in managing and analysing scientific data.
This thesis explores the application of BioBERT, a specialized BERT model, for the identification of polyphenol and food entities in scientific literature. The primary objective of this dissertation is to automate the extraction of polyphenol content from vast collections of scientific texts using Named Entity Recognition (NER) techniques, thereby addressing the significant challenge of managing extensive scientific data. By leveraging state-of-the-art deep learning and natural language processing methodologies, the study aims to enhance the accuracy and efficiency of data extraction processes, contributing valuable insights to nutritional research and supporting the maintenance of comprehensive nutritional databases such as Phenol-Explorer.
The methodology involved collecting a diverse dataset of scientific papers related to polyphenol content in foods and preprocessing this data to ensure suitability for training and evaluation. For this very purpose, a deep learning model was designed and trained to accurately extract polyphenol and food entities. Initial experiments without optimization revealed significant overfitting, which was mitigated by the implementation of the AdamW optimizer. Data augmentation techniques further enhanced the model’s robustness by increasing the diversity of training samples, thereby improving its resilience to noisy inputs and minor textual variations.
The results demonstrate that the BioBERT model, combined with advanced NER techniques, effectively detects polyphenol and food entities with high accuracy, achieving precision, recall and F1-scores around 0.98, and an overall accuracy of 0.99. The successful application of the AdamW optimizer significantly reduced overfitting, as evidenced by the consistent decrease in validation loss across trials. These outcomes validate the model’s capability to generalize well to unseen data, underscoring its robustness and adaptability in handling complex biomedical information.
The contributions of this research extend beyond technical achievements. By automating the extraction of polyphenol data, the study addresses the ongoing challenge of data overload in scientific research, ensuring that critical insights are efficiently extracted and used. The establishment of a continuous update mechanism for the Phenol-Explorer database supports the maintenance of accurate and relevant nutritional data, thereby enhancing the impact of polyphenol research on health and nutrition science.
In conclusion, this thesis not only achieves its initial objectives but, also lays a solid foundation for future advancements in leveraging artificial intelligence for nutritional research. The integration of BioBERT through rigorous scientific methodologies exemplifies the transformative potential of AI in advancing our understanding of nutritional sciences and supporting evidence-based practices. Future research directions include expanding the dataset, incorporating relation extraction methods to identify interactions between polyphenols and foods, and quantifying polyphenol concentrations in foods. These advancements will further enhance the model’s capabilities, enabling more detailed and automated information extraction from scientific papers, ultimately contributing to the advancement of nutritional research and the broader application of AI in managing and analysing scientific data. Read More