What's in a MalaCard?


This page provides information about the various MalaCards sections and tables.

 

MalaCards Disease List 

An offline process is responsible for generating the comprehensive integrated list of diseases by mining heterogeneous, partially overlapping sources (see below for list of sources), unifying names and acronyms, and organizing characterizations.

Disease name unification is effected by transforming each name to a canonical form. The canonical form is constructed by lowercasing, lexical sorting of words, removing special characters and common words like disease, syndrome, as well as merging equivalent words (like juvenile and childhood). This canonical form is then hashed and used for comparison against transformed new names.

For each malady a unique symbol is generated, composed of the first letter of its name, followed by the next two consonants, followed by a serial number. For example, the symbol generated for rett syndrome is RTT001.

 

MalaCards Header 

This section provides the malady name, symbol and acronym (where available).

A stats bar provides some statistics related to the information shown in the card.

 

Annotation schemes

MalaCards employs four different annotation schemes, as follows:

  • Source mining- Mining data sources for disease-specific information is used to populate relevant sections of a MalaCard. To this end we define two types of sources. Primary sources are those which are used to derive both disease names and annotations. Secondary sources are those from which only annotations are derived. These sources generally contain non-disease terms intermixed with disease information. Direct source mining provides information for the Aliases & descriptions, summaries, clinical features, drugs and therapeutics, genetic tests and anatomical context sections. When appropriate, in-house analysis is performed, in order to link annotations to diseases, or to integrate and display disease specific data. For example, we have developed a process that utilizes UMLS concepts to map diseases to drugs used for its treatment (see 'MalaCards sections').
  • GeneCards search- One central annotation source for MalaCards is an automated use of the GeneCards search engine, including section-specific advanced searches. For example, the affiliated gene set with a disease is obtained by using the disease name as a search string, which allows the generation of the related genes section in MalaCards. Importantly, gene association does not imply causality between the gene and the disease. Associations sometimes include annotation like 'unaffected', and this can be verified using the 'GeneCards section context' link. Similarly, the publications associated with a disease are obtained via a search for its name in all of the publication titles within GeneCards.
  • GeneDecks set analyses- Malacards implements a strategy in which gene-disease relationships within GeneCards are used to create disease-specific content. For this, we leverage GeneCards' GeneDecks tool, in its Set Distiller mode (more information). The disease-associated gene set (generated as described above) is forwarded to GeneDecks, which distills statistically significant descriptors enriched in this set. For example, in the 'Atherosclerosis' MalaCard, 'cardiovascular system' is thus entered into the phenotypes section, while 'apoptosis' into the pathways section. This process also assigns a relevance score for every hit, and is employed to populate the related diseases, phenotypes, pathways, compounds and GO terms sections. In these sections, the relevant tables display the top affiliating genes, linked to their respective contexts within GeneCards.
  • MalaCards search- We use MalaCards searches to populate additional sections, including elucidating new relations amongst diseases in the related diseases section and associating tissues in the anatomical context section.

MalaCards Scores 

  1. MalaCards InFormaTion Score (MIFTS). Assigned to each disease by summing the base 10 logarithms of the counts of its populated annotations. MIFTS defines the richness of information in each card. This score currently ranges from 1 to 101, with the MalaCard scored 101 being the most annotated card.

  2. MalaCards composite relevance score (MCRS). Assigned to descriptors provided by the GeneDecks set analyses mechanism
    The score is defined as:

    MalaCards composite relevance score (MCRS)

    where:

    SGD is the rank of the GeneDecks score, which orders descriptors first by their GeneDecks p-value, and then by the size of the group of genes associated with the descriptor. SLR(i) is the Solr search engine score's rank of a gene shared between the descriptor and the disease. Ns is the number of data sources supporting the descriptor. Thus, the score takes into account the hit importance in GeneCards, as well as the significance of the specific attribute according to GeneDecks, as well as the number of supporting sources.
  3. GeneCards search relevance score (GSRS). Obtained by the Solr based GeneCards search engine. This relevance score takes into account the number of hits, and the importance of the fields in which they were found.

  4. MalaCards search relevance score (MSRS). Obtained by the Solr based MalaCards search engine, as described in MalaCards search guide.

  5. MalaCards composite related diseases score (MCRDS). Assigned to entries in the related diseases section. It is computed as the sum of 1) MalaCards composite relevance score and 2) MalaCards search relevance score. Prior to this, each of these two score values is normalized by equating the means as well as the standard deviations for the two distributions across all of MalaCards. A bonus amounting to the average of the two scores is added to diseases coming from both GeneDecks set analysis and MalaCards search.

Disease Family Classification

In order to start adding structure and hierarchy into MalaCards, disease types were clustered into families. ~3000 diseases are currently distributed amongst ~700 families. This grouping was done on a lexical basis, looking for names having the same base but distinguished by their type. For example, a search for 'Alzheimer' retrieves over 1500 MalaCards ranked by a score. By perusing the list, we observe that the top 19 hits actually describe 2 types of Alzheimer disease, the general type and the familial type. The disease in each 'family' having the highest MIFTS score is designated to be the 'parent'. In the case of a tie, priority is given to the disease associated with the highest number of sources, and then to the one with the shortest name. The parent/child attribute is denoted by 'P' or 'c' respectively and appears in search results, as well as at the top of the related diseases section of relevant cards.

Card Export

A user can download the card data to a parsable excel sheet using the 'Export this MalaCard' button on the left hand side of the summaries section. Data for scientific collaborations can also be requested by filling out the Feedback Form .

MalaCards Sections

Summaries

This section displays descriptions of a disease, as extracted from a subset of the sources listed below, as well as a MalaCards unique summary describing the card content. Other summaries typically include a short definition of the disease, organs involved, etiology and main symptoms.

MalaCards generated summary groups the main annotations in the specific card into a descriptive text.

Aliases & Descriptions 

This section displays synonyms and aliases for the relevant MalaCards malady, as extracted from a subset of the sources listed below. Strongly similar aliases, even if trivially different, are included, to match common expectations and to facilitate searches. The disease name appears first, with its own associated source-indicating superscripts. The alias list is sorted first by the count of contributing sources, sub-sorted by descending length.

If available, presented also are external ids, which are cross references to IDs of external databases/ontologies. The external IDs are searchable.

Related Diseases

The top of the section displays the disease family classification if available.

Related diseases are obtained in two ways: First, by GeneDecks set analysis, whereby other diseases computed to have significant shared descriptors for the target disease's affiliated genes are collected. Second, as matched by MalaCards searches. All obtained related diseases are sorted by the MalaCards composite relevance score.

Network images are generated using the gephi toolkit. Images are generated for the top 20 scored related diseases. Each related disease is a node, while edges represent a connection to the target disease, as well as interconnections between the related diseases themselves, where available. Images are not generated for diseases having fewer than 8 connections. Currently, node distances from the MalaCard's disease (shown with red filled circle, and edges colored in red) are not significant.

Clinical Features

Provides information and links about symptoms and other clinical attributes of the disease, extracted from a subset of the sources listed below. Symptoms typically represent changes from normal function, sensation, or appearance, but may also be other MalaCards maladies with their own cards.

Drugs & Therapeutics

This section proides information regarding:

  • Approved drugs- deep link for search in CenterWatch for newly approved drugs.
  • Clinical trials:
    1. deep link for search in CenterWatch for clinical trials.
    2. deep link for search in ClinicalTrials.gov for clinical trials.
    3. deep link for search in NIH Clinical Center for clinical research studies .
  • Inferred drug relations via UMLS/NDF-RT- Combined information from the Unified Medical Language System (UMLS) and the National Drug File-Reference Terminology (NDF-RT). Initially, a MalaCards name is mapped to a UMLS concept representing a disease by utilizing the MetaMap system. Subsequently, the NDF-RT terminology within UMLS is used to provide a link of such disease concepts to drug(s) via the 'may be treated by' relationship. This work was done in collaboration with C. Paul Morrey.
  • Cell-based therapeutics approaches from Lifemap Discovery:
    1. Stem-cell-based therapeutic approaches
    2. Embryonic/adult cultured cells which are candidate therapeutic approaches

Genetic Tests

This section provides descriptions of genetic testing, specialized cytogenetic testing, and biochemical testing for inherited disorders. Genetic tests are extracted from a subset of the sources listed below. The section shows both clinical and research laboratories performing genetic tests.

Anatomical Context

This section provides descriptions on cells, compartments, and organs relevant to the disease. Anatomical context data is extracted from a subset of the sources listed below. The MalaCards organs/tissues related to the disease are obtained by using the malacards search mechanism on a set of predefined tissues. Foundational Model of Anatomy (FMA) ontology data interconnections are extracted via the Disease Ontology.

Phenotypes

This section provides mouse orthologs phenotypes which are obtained by being contextually related to the key disease using the GeneDecks mechanism described above, applied to the set of affiliated genes. Phenotypes are scored according to their relevance (see above ).

Publications 

This section provides publications associated with the disease, currently obtained by searching all of the publications in the GeneCards database. For each publication, the title and link to the PubMed article is supplied.

The articles are ranked, first according to the number of sources that associate the article with the disease-related genes, then by date of publication, and then according to the individual source scores for article/gene relationships.

Genes

This section provides the list of affiliated genes found to be associated textually with the key disease, using the GeneCards search mechanism. The table shows gene symbols, descriptions, relevance scores, and the GeneCards section in which the disease is associated (i.e.. the context of the search hit). The relevance score is computed by the GeneCards search engine, which represents the relevance of each gene to the disease. This relevance score takes into account the number of hits, and the importance of the fields in which they were found. For exact details about the computation of this score see: http://genecards.org/index.php?path=/HTML/page/searchHelp#relevance.

Expression 

This section provides normal tissue expression profiles for genes affiliated with the disease, via experimental results from a subset of the sources listed below.

The expression plots display the tissue specific gene expression levels typifying the disease. The Y-axis represents the genes ranked by expression levels, the x-axis represents the tissues. Each column shows up to 100 of the most highly expressed disease-associated genes, ranked by their expression level. The size of the bar is determined by the number of genes in the tissue that have the top 20% expression level out of the disease related genes. The corresponding gene-squares are colored according to their log2 expression levels. The color scale is common to all diseases in all tissue. The closer the color to red – the higher the expression for the specific gene. Higher bars for specific tissues represent tissues in which the fraction of highly-expressed genes of the disease gene-set is higher than other tissues for that disease. 

Pathways

This section provides pathways related to the disease, obtained by being contextually related to the key disease using the GeneDecks mechanism described above, applied to the set of affiliated genes. The pathways are extracted from a subset of the sources listed below. Entries are scored according to their relevance (see above ).  

Compounds 

This section provides relationships between MalaCards diseases and chemical compounds, obtained by being contextually related to the key disease using the GeneDecks mechanism described above, applied to the set of affiliated genes.  Drugs and Compounds are extracted from a subset of the sources listed below. Entries are scored according to their relevance (see above ).  

GO Terms

This section provides cellular component ontologies, biological process ontologies and Molecular function ontologies enriched in the set of genes affiliated with the disease. The table displays the name of the relevant ontology, the GO ID, which is the identifier used by GO and linked to the GO entry, and the genes related to the disease as well as to the specific ontology using the GeneDecks mechanism described above. The entries are scored according to their relevance (see above ).

Sources

This section provides links to all of the following MalaCards sources, including those obtained via GeneCards:

  • BioGPS - A free extensible and customizable gene annotation portal, a complete resource for learning about gene and protein function.
  • CDC - The Centers for Disease Control and Prevention contains information about various diseases and health conditions.
  • Cell Signaling Technology - CST (Cell Signaling Technology) provides discovery tools for cell signaling research, including information about pathways and phosphorylation sites.
  • CenterWatch - CenterWatch is the leading trusted source for global clinical trial information.
  • ClinicalTrials - ClinicalTrials.gov is a registry and results database of federally and privately supported clinical trials conducted in the United States and around the world.
  • Disease Ontology - Disease Ontology provides a hierarchical open source ontology for the integration of biomedical data that is associated with human disease.
  • diseasecard - Diseasecard is an information retrieval tool for accessing and integrating genetic and medical information for health applications.
  • DISEASES - The University of Copenhagen DISEASES database provides disease-gene associations mined from literature.
  • DrugBank - The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.
  • EMD Millipore - EMD Millipore offers a broad range of Life Science tools, technologies, and services, and creates personalized solutions to industry challenges to assure scientific success.
  • FMA - The Foundational Model of Anatomy Ontology (FMA) is an evolving computer-based knowledge source for biomedical informatics; it is concerned with the representation of classes or types and relationships necessary for the symbolic representation of the phenotypic structure of the human body.
  • Gene Ontology - The Gene Ontology, is a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.
  • GeneCards - GeneCards provides gene-centric information, automatically mined and integrated from a myriad of data sources, resulting in a web-based card for each of the tens of thousands of human gene entries.
  • GeneDecks - The GeneDecks project offers views and analyses about gene sets.
  • GeneReviews - GeneReviews are expert-authored, peer-reviewed, current disease descriptions that apply genetic testing to the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions.
  • GeneTests - GeneTests is a clinical information resource relating genetic testing to the diagnosis, management, and genetic counseling of individuals and families with specific inherited disorders.
  • Genetics Home Reference - Genetics Home Reference provides consumer-friendly information about the effects of genetic variations on human health.
  • HMDB - HMDB is an electronic database containing detailed information about small molecule metabolites found in the human body.
  • ICD9CM - The International Classification of Diseases, 9th Revision, Clinical Modification. ICD-9-CM is the official system of assigning codes to diagnoses and procedures associated with hospital utilization in the United States.
  • KEGG - Kyoto Encyclopedia of Genes and Genomes (KEGG) provides pathway information.
  • LifeMap Discovery™ - LifeMap Discovery™ is a state-of-the-art platform for embryonic development and stem cell biology research.
  • MalaCards - MalaCards is an integrated database of human maladies and their annotations, modeled on the architecture and richness of the popular GeneCards database of human genes.
  • MedlinePlus - Medlineplus contains health information from the National Library of Medicine.
  • MeSH - MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.
  • MGI - MGI (Mouse Genome Informatics, formerly MGD) provides a comprehensive source of information on the experimental genetics of the laboratory mouse; it includes information on mouse markers, mammalian homologies, probes and clones. GeneCards presents links to mammalian homology pages, the name of the mouse gene, its location (in centiMorgan), phenotypic alleles, and links to the entries for the mouse gene.
  • NCBI Bookshelf - NCBI Bookshelf contains a collection of biomedical textbooks.
  • NCIt - NCI Thesaurus (NCIt) provides reference terminology for many NCI and other systems. It covers vocabulary for clinical care, translational and basic research, and public information and administrative activities.
  • NDF-RT - NDF-RT organizes the drug list into a formal representation. NDF-RT is used for modeling drug characteristics including ingredients, chemical structure, dose form, physiologic effect, mechanism of action, pharmacokinetics, and related diseases.
  • NIH Clinical Center - The center for clinical research of the NIH includes general and patient information and organization resources.
  • NIH Rare Diseases - The Office of Rare Diseases Research (ORDR) at the National Institutes of Health (NIH) coordinates research and information on rare diseases.
  • NINDS - The National Institute of Neurological Disorders and Stroke (NINDS) conducts and supports research on brain and nervous system disorders.
  • Novoseek - Novoseek extracted knowledge from biological databases and text repositories, enabling users to uncover the knowledge hidden within these data sources. The relevance scores of elements related to genes (chemical substances and diseases) are based on the analysis of co-occurrences of two elements in Medline documents. The observed number of documents where both elements appear together and the number of documents where both appear independently are compared to an expected value based on a hypergeometric distribution. The Novoseek project is no long accessible on the web, and is available upon request. MalaCards Novoseek data is based on GeneCards data from Novoseek from 2011.
  • OMIM - OMIM (Online Mendelian Inheritance in Man) is a catalog of human genes and genetic disorders with a lot of information about many different aspects (medical and genetic). GeneCards presents a list of diseases listed as allelic variants in the respective entry for the gene, synonyms for the gene, and a link to the OMIM database entry.
  • PharmGKB - PharmGKB is an integrated resource about how variation in human genes leads to variation in our response to drugs.
  • PubMed - PubMed comprises more than 22 million citations for biomedical literature from MEDLINE, life science journals, and online books.
  • QIAGEN - the leading provider of sample and assay technologies.
  • R&D Systems - A company providing antibodies, assays, kits and additional products and services.
  • Reactome - Reactome provides curated knowledgebase of biological pathways in humans.
  • SABiosciences - SABiosciences, a Qiagen biotechnology company, develops and markets a broad range of innovative and cost-effective research tools.
  • SNOMED-CT - SNOMED Clinical Terms (SNOMED CT) is the most comprehensive, multilingual clinical healthcare terminology in the world.
  • Thomson Reuters - GeneGo is a data mining & analysis solutions in systems biology
  • Tocris Bioscience - Tocris Bioscience is a leading supplier of high performance life science reagents, peptides and antibodies, with customers in virtually all of the world's major pharmaceutical companies, universities and research institutes.
  • UMLS - The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records.
  • Wikipedia - Wikipedia is a free encyclopedia built collaboratively.

 

References

  1. Navarro G (2001). "A guided tour to approximate string matching". ACM Computing Surveys 33 (1): 31–88. DOI:10.1145/375360.375365