MalaCards Search Guide


This page provides information about MalaCards search mechanisms.

 

MalaCards Search features 

MalaCards searches are based on Lucene/Solr and support the following features:

  • Uses Stemming in all of its searches so that similar words will be found rather than just exact matches.
  • A search for multiple words (ex. Alzheimer Disease) behaves as an AND within the entire document; i.e. each of the words must exist in at least one of the sections of the matched MalaCards. To search for an exact phrase, simply add quotes to your search.
  • Parentheses should be used in searches for complex boolean strings in order to indicate precedence. Otherwise AND operations will take precedence over OR operations. See example below. Please note that Booleans must be capitalized. A search using "and" will not produce the same results as one using "AND".
  • Disease name and aliases fields are boosted, so searches for those terms will match MalaCards accordingly named with higher scores than MalaCards containing the names in other fields.

 

What Do I Get? 

Search results provide a table of: 1. Serial number of the result 2. Family association: P- parent, c- child 3. MalaCards ID 4. Disease name 4. MIFTS 5. Relevance score.

 

Simple Search

Enter an expression into the search field on the MalaCards homepage and click 'search'.

Notes:

  • The MalaCards search is case insensitive.
  • When search terms are encapsulated with double quotes, searches are for exact matches. Exact match enables you to determine the distance between two or more terms. The exact match ignores trivial words, like "a","and","then" etc. For example, searching "heart brain" would also retrieve the string "heart and a brain". You can also use Tilde (~) to determine the distance between the terms (excluding trivial words), i.e. : "heart brain"~50, would search for heart brain within the distance of at most 50 words, excluding trivial words.

 

Wild card Searches 

The * character serves as a wildcard, which matches all possible character strings (including no characters). Additionally, the ? characters can be used to represent 1 character. Wild cards can be used anywhere in the search string except as the first character. Please note that wild card searches using initial wild cards, for example *gammaglob*, are not supported.

For example:

Searching for acid* will find maladies whose cards have the string acid, acidic, acidosis, and aciduria. Searching for acid? will yield those that have stings such as ACID2 and acidi. The term ac*d find acid, ACY1D, and ACD, while ac?d finds acid and ACAD-8, among others.

Note that this is different from stemming, which matches strings that are considered to be related to the specified keyword. Stemming allows the search for acid (no wild card) to find acid and acidic.

 

Search Examples: 

* In the search (neurodegenerative or senile) and Alzheimer the use of parentheses is important. Without parentheses the AND takes precedence in this search so that the results returned are for the neurodegenerative or senile AND alzheimer.

 

Search Type Search String Search Description
Simple search breast Exact word match (case insensitive).
Wild Card Search live* Any object that begins with the string "live" or some derivative of "live" determined by the search engines Stemming algorithm.
Multiword Search obesity diabetes Search behaves as if an AND was used (see the AND search below).
Multiword Search obesity AND diabetes All strings or variants of strings must exist in the MalaCard.
Multiword Search obesity OR diabetes At least one of the strings or its variants must exist in the MalaCard (notice the difference from the AND search).
Multiword Search (neurodegenerative OR senile) AND Alzheimer Finds all instances of either neurodegenerative AND alzheimer or senile AND alzheimer.
Multiword Search "macular degeneration" Exact phrase search. All words in the order that they are entered must be in the MalaCard.

Advanced Search

Advanced search enables you to browse MalaCards for more specific results. A broader variety of search options is offered in order to focus the search.

To use the Advanced search click the Advanced link next to the search box at the MalaCards homepage, or next to the search box within each card - a new page with more search options appears.
You will be able to choose which section of the MalaCards you would like to search in (e.g. aliases, summaries, pathways).

Now type the search string in the search field. If you wish to search in more than one field Click on the + button next to the search box to get another field to add more terms to your search, or to search in multiple fields within a MalaCard for your search terms. When adding another term in a separate field, you may choose to search for your first term AND your second term or to search for your first term OR your second term by changing the first select box in the new row from "and" to "or".

As in the simple search, you may enter multiple terms and explicit sub-queries in each search field.

Hit Context (Minicard) 

Hit Context (Minicard) - The search results are first displayed closed, showing the id, family, MCID, disease name, MIFTS score and relevance score. To open the minicard click on the plus to the left of the Family column on the appropriate minicard. All fields of the MalaCard in which your search term(s) were found will be displayed. The sections' names, which appear on the left, are linked to the corresponding sections in the MalaCard. All of the keywords entered in your search, including any variants found due to stemming, will be highlighted in the minicards.

The minicard list is sorted by relevance (determined by the relevance scoring method).

Relevance Scores 

The search platform used is SOLR, based on Apache's Lucene text search API.

When a term is searched Lucene returns a set of scored hits.

A "hit" represents a document (in our case a MalaCard), whose fields (actual annotations) were previously indexed by Lucene.

The scoring is calculated by a Lucene defined algorithm: (see Lucene's Similarity class)

score(q,d)   =   coord(q,d)  ·  queryNorm(q)  ·  tf(t in d)  ·  idf(t)  ·  t.getBoost() ·  norm(t,d)
t in q

the factors in this formula are : (see Solr Relevancy FAQ)

  • tf stands for term frequency - the more times a search term appears in a document, the higher the score
  • idf stands for inverse document frequency - matches on rarer terms count more than matches on common terms
  • coord is the coordination factor - if there are multiple terms in a query, the more terms that match, the higher the score
  • lengthNorm - matches on a smaller field score higher than matches on a larger field
  • index-time boost - if a boost was specified for a document at index time, scores for searches that match that document will be boosted.
  • query clause boost - a user may explicitly boost the contribution of one part of a query over another.

Each field can be "boosted" - this means increase the weight of a specific field at search time.

In MalaCards, we "boost" the following fields:

  • Malady Name
  • Aliases and Descriptions

You can read more about Lucene's scoring mechanism here: Apache Lucene - Scoring

To ensure showing the best precision we display the score as (base 2 log of the score) + 10