We have worked on a number of different projects where a client needed to parse scientific literature and identify occurrences of molecules or proteins.
As an example, the molecule on the right is Aspirin. This is still a trademark of Bayer in some countries. But in a paper it could appear under acetylsalicylic acid, 2-acetoxybenzenecarboxylic acid,
C9H8O4, or a number of identifiers such as DB00945. There could also be identifiers that refer to other molecules, or identifiers that refer to only one version of a molecule.
Another example we have encountered often in clinical papers is the gene ERBB2, which is important in certain types of breast cancer. ERBB2 is also called Erb-B2 Receptor Tyrosine Kinase, HER2, HER-2 and many other names. These names often also refer to the protein expressed by the gene. Many names are similar to common English words, and are not always capitalised in text.
Because of these pathological effects, the task of identifying names of proteins, genes and molecules in scientific literature is fraught with difficulty.