Table of Contents
Definition / general | Essential features | Terminology | Applications | Limitations | Software | Videos | Additional references | Board review style question #1 | Board review style answer #1 | Board review style question #2 | Board review style answer #2Cite this page: Gonzalez R, Norgan AP. Data repositories. PathologyOutlines.com website. https://www.pathologyoutlines.com/topic/informaticsdatarepositories.html. Accessed December 11th, 2024.
Definition / general
- Database infrastructure that compiles, manages and gives access to data and associated metadata and documentation (Ghent University: Using a Data Repository [Accessed 14 March 2024])
- Place to hold, organize logically and make data available (Harvard Medical School: Data Management Terminology [Accessed 14 March 2024])
- Tool to share, preserve and discover research outputs (NNLM: Data Glossary [Accessed 14 March 2024])
Essential features
- Most healthcare data are unstructured (e.g., freeform text and images) (Healthc Inform Res 2019;25:1)
- Database management system (DBMS) is software for creating and maintaining databases
- Traditional relational databases may be inadequate for applications and analyses of high volume healthcare data
- Data warehouses and data lakes are commonly used to improve the organization and accessibility of healthcare data
Terminology
- Data: refers to values; its meaning depends on the context (e.g., numbers, text, audio, images, video) (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019, NNLM: Data Glossary [Accessed 14 March 2024])
- Data types
- Structured: follows a predefined schema (organization)
- Semistructured (i.e., schema-less or self describing data): contains elements of structured and unstructured data
- Unstructured: has no identifiable structure (Inform Med Unlocked 2023;39:101270)
- Data types
- Dataset: a collection of related data (NNLM: Data Glossary [Accessed 14 March 2024])
- Big data: refers to datasets that are too large to process on a personal computer or "high volume, high velocity or high variety information"; however, a precise definition is lacking (NNLM: Data Glossary [Accessed 14 March 2024], Gartner: Big Data [Accessed 14 March 2024], J Big Data 2022;9:3, Front Big Data 2023;6:1271639, Int J Med Inform 2018;114:57, PLoS One 2020;15:e0228987)
- Most healthcare big data are unstructured (e.g., freeform text from clinical notes and clinical images) (Healthc Inform Res 2019;25:1)
- Database (electronic or computerized database): a structured, organized collection of data stored and accessed electronically (NNLM: Data Glossary [Accessed 14 March 2024])
- Can have different structures (hierarchical, flat, object oriented, relational, graph and NoSQL databases) (see Database fundamentals)
- Database management system (DBMS): software used to create and maintain databases; it facilitates the following processes
- Database definition
- Describes the types, structures and constraints of the data to be stored (metadata)
- DBMS stores this information in the form of a database catalog or dictionary
- Database construction: storing the data on some controlled storage medium
- Database manipulation: this includes querying to retrieve data, updating the database and generating reports
- Database sharing
- By allowing multiple users and programs to access it simultaneously
- A database can be accessed with an application program
- To retrieve data, it sends queries to the DBMS
- To read, insert, delete or update data, it uses transactions
- DBMS can be centralized (i.e., 1 information storage site) or distributed (i.e., many information storage sites connected by a computer network) (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Relational database management system (RDBMS)
- DBMS based on a relational data model (i.e., data stored as an ordered sequence of values [touples] and information about the relationship between data elements)
- Structured query language (SQL) is the standard computer language to interact with RDBMS (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Usually focused on data consistency and structured data storage (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Cannot handle most current big data applications (Laurent: Data Lakes, 1st Edition, 2020)
- DBMS based on a relational data model (i.e., data stored as an ordered sequence of values [touples] and information about the relationship between data elements)
- NoSQL database management system (NoSQL DBMS)
- Can use various data models: document based (e.g., MongoDB and CouchDB), key value stores (e.g., DynamoDB), column based (e.g., BigTable) and graph based (e.g., Neo4J and GraphBase) or hybrid (such as Cassandra, which use concepts from both key value stores and column based systems)
- Often massively distributed and focused on performance, data availability, replication and scalability
- Commonly employed for big data (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Database system: composed of a database and a DBMS (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Database definition
- Data warehouse: centralized data repository designed to support data analytics and business intelligence; it usually stores large amounts of historical data from different sources in an organized manner (Boddeda: Cloud Data Architectures Demystified, 2023)
- Fundamental features are (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Subject oriented: designed with a potential question(s) in mind (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Integrated: data extracted from different sources may need to be transformed (e.g., cleaned or reformatted) to make it comparable (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Nonvolatile: loaded data should not be removed or modified (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Time variant: data are stored for an extended time; historical data can be analyzed and used to produce trends reports (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019, Domdouzis: Concise Guide to Databases - A Practical Introduction, 2nd Edition, 2021)
- Acronym ETL (extract, transform, load) describes how data are moved into a data warehouse (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Tools used to handle these processes are known as ETL tools (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Fundamental features are (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Data lake: a set of centralized repositories for vast amounts of raw data (structured, semistructured or unstructured), organized into identifiable datasets, described by metadata and available on demand (Laurent: Data Lakes, 1st Edition, 2020)
- As opposed to data warehouses, data lakes
- Are not subject oriented (i.e., store data without a defined purpose) (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019)
- Load data in their native format (i.e., store raw data) (Front Big Data 2022;5:945720)
- Transform data on demand (i.e., when needed) (Laurent: Data Lakes, 1st Edition, 2020)
- Acronym ELT (extract, load, transform) describes how data are moved into data lakes (Laurent: Data Lakes, 1st Edition, 2020)
- As opposed to data warehouses, data lakes
- Other architectures
- Data marts: similar to data warehouses but with a restricted scope (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
- Frequently used to provide targeted views of the data to specific groups of an organization (Boddeda: Cloud Data Architectures Demystified, 2023)
- Data lakehouse: a repository that stores structured and unstructured raw data (with the scalability and flexibility of a data lake) but includes built in features to analyze data in real time (with the performance and reliability of a data warehouse) (Boddeda: Cloud Data Architectures Demystified, 2023)
- Data hub: refers to a centralized point to manage data and provide access across the organization (Boddeda: Cloud Data Architectures Demystified, 2023)
- Data mesh: distributes data ownership and governance across different teams or domains; each of them is responsible for the quality and accessibility of their data (Boddeda: Cloud Data Architectures Demystified, 2023)
- Data fabric: provides a unified data view across different systems and applications; it enables data integration and management (Boddeda: Cloud Data Architectures Demystified, 2023)
- Data marts: similar to data warehouses but with a restricted scope (Elmasri: Fundamentals of Database Systems, 7th Edition, 2015)
Applications
- To store and manage digital pathology images and associated data so they can
- Be easily accessed, reviewed and shared with other pathologists
- Be utilized for research (see Digital imaging fundamentals & standards) or educational purposes (see Education)
- To store other laboratory data
- For billing and quality control purposes (see Database fundamentals) (Semin Diagn Pathol 2019;36:294)
- To integrate laboratory data with other healthcare data
- For data analytics and artificial intelligence (AI) based decision support tools (Nat Methods 2023;20:475, Trends Cancer 2024;10:147, J Pathol Inform 2023;15:100347)
Limitations
- Integrating data from heterogeneous sources, achieving regulatory compliance and protecting data quality and privacy can be challenging (Big Data Cogn Comput 2022;6:132, Inform Med Unlocked 2023;39:101270)
- Storing large amounts of data may not be affordable or profitable for some institutions (J Clin Pathol 2021;74:409)
Software
- Hadoop and Spark are common open source big data infrastructure frameworks to store and process large datasets (Big Data Cogn Comput 2022;6:132)
- Popular data warehousing tools and services include (Big Data Cogn Comput 2022;6:132)
- Amazon Web Services (AWS)
- Microsoft Azure
- Oracle
- Snowflake platform
- IBM
- Popular data lake / lakehouse tools and services are (Big Data Cogn Comput 2022;6:132)
- Other relevant software can be found at
Videos
Data management playlist - IBM Technology
Data storage essentials playlist - IBM Technology
Additional references
- PathLAKE: Pathology Image Data Lake for Analytics Knowledge & Education [Accessed 14 March 2024], Memorial Sloan Kettering Library: Artificial Intelligence - Datasets and Repositories [Accessed 14 March 2024], JMIR Form Res 2020;4:e17687, Mayo Clinic Platform: Products & Services [Accessed 14 March 2024], Springer Nature: Data Repository Guidance [Accessed 14 March 2024]
Board review style question #1
Which acronym best describes how data are moved into a data warehouse?
- ELT (extract, load, transform)
- ETL (extract, transform, load)
- ITE (import, transform, export)
- LTE (load, transform, export)
Board review style answer #1
B. ETL (extract, transform, load). Data are obtained from 1 or more sources following specific criteria in the first phase of this process (i.e., extract). Then, data are cleaned and converted into specific formats to be stored (i.e., transform). This can make data from different sources comparable. Later, data are uploaded to the warehouse (i.e., load) (Frisse: Essentials of Clinical Informatics, Illustrated Edition, 2019). Answer A is incorrect because data are transformed before loading in data warehouses (as opposed to data lakes, in which data are transformed on demand after loading). Answers C and D are incorrect because export and import are not part of the terms referred to in this acronym.
Comment Here
Reference: Data repositories
Comment Here
Reference: Data repositories
Board review style question #2
Which of the following statements is true about data lakes?
- Are subject oriented
- Cannot be used with healthcare data
- Cannot store raw data
- Can store unstructured data
Board review style answer #2
D. Can store unstructured data. Data lakes can store structured, semistructured and unstructured data. Answers A and C are incorrect because unlike data warehouses, data lakes are not designed with a potential question(s) in mind (i.e., they are not subject oriented) and can store data in their native form (i.e., raw data). Answer B is incorrect because data lakes can be used to improve the organization and accessibility of healthcare data.
Comment Here
Reference: Data repositories
Comment Here
Reference: Data repositories