Marie Curie Chair of Excellence lorem, sed vulputate

Constraint Solving and Language processing for Bioinformatics - a three-way interdisciplinary project


By:

Veronica Dahl


Scientific and Social Motivation of this project:

Unprecedented volume growth of biological data over the past few years has created formidable challenges to the fields of computational molecular biology and of biological knowledge retrieval. Previous information processing methods cannot keep up with this "information tsunami", which washes over heterogeneous landscapes, most notably: human language text produced in the form of articles, books, web sites, etc. and genetic code text in nucleic acid language, such as DNA sequences.

Meaningful, useful, and in particular, timely processing of these sources needs to conjure the full power of all we have learnt so far in the fields of natural language processing, text mining and logical inference. This project develops theories, tools and applications in the intersection of these three fields, in the expectation that the present formidable challenges may be turned into formidable opportunities for scientific breakthrough, by synergistically exploiting the state-of-the-art that each of these fields has arrived at independently.

Project Description:

The many languages of molecular biology include those "sentences" of nucleic acid formed of the "words" A, C, T and G (shorthand for, respectively, the compounds Adenine, Cytosine, Thymine, and Guanine. Those sentences contain patterns which can be discovered by analysis of the sentences, using methods developed in computational linguistics. Some of these patterns also occur in natural languages. Some of these forms are relatively simple yet not necessarily easy to parse: e.g. palindromes (sequences that read the same from left-to-right or from right-to-left, as the Spanish sentence, modulo blank spaces: ''Dabale arroz a la zorra el abad", or as the nucleic acid string sequence A C C T G G T C C A): their length can vary, and their position within in a string is unpredictable.

Given these features, the objectives of our research are threefold:

- Developing methods of analysis which are free from extra machinery needed by previous methods, which typically need to resort to either preliminary alignment or statistical processing.

- Adapting text mining AI methodologies for computational biology text mining. We use modern methods such as constraint reasoning and logic programming, moulding them to the needs of our applications.

- Studying cross-fertilizations between the three fields of logic programming, computational molecular biology and natural language processing.

The possible uses of our results include medical applications as well as computational linguistics breakthroughs. For instance our research on finding from a secondary RNA structure, a sequence which folds onto that structure can be exploited for the achievements made in the in vitro genetics (done outside the living organism) field. Nowadays, scientists are capable of replicating any RNA in a test tube invitro, which would eventually help in finding new paths for drug design or can have industrial use, and our research on models for mining medical databases can result in early detection of conditions such as lung cancer.

Methodology:

In this project we use recent and powerful logic-based methodologies, in particular Constraint Reasoning in their CHR and other incarnations, integrated with abduction and assumptions. All these have been integrated in HYPROLOG (and used in collaborations with France and Denmark - see http://control.ruc.dk) which provides hypothetical reasoning with novel flexibility in the interaction between the different paradigms, including all additional built-in predicates and constraints solvers that may be available through CHR.

Results so far:

Our most important scientific achievement within this project was the conclusion and publication of results which revolutionized the process for finding signature oligos from DNA sequences, through our methods which reduced the processing time needed, from 6 months to 20 minutes [1]. We have generalized these results into a novel, high level methodology for mining both linguistic and biological texts [2], and developed a DNA inspired model of computational linguistics [3]. We have explored modern logical inference based methods for implementing Property Grammar, a current and powerful tool for robust parsing in the face of incomplete or erroneous input, and extended Property Grammar itself to include semantic inference capabilities [4,5] . We have also developed methods to recognize named entities in medical text, with application to medical texts in Chinese [6], and contributed as well to methodologies for teaching AI to students in the humanities [7]. We are now synthesizing many of these results into a very powerful computational and linguistic tool - abductive logic grammar - which combines parsing with the ability to construct meaning representation through abduction [9,10]. Abduction is the unsound but useful inference rule which Sherlock Holmes wrongly called "deduction", and which proceeds from observed facts to hypothesis that explain those facts.

KEYWORDS

Non-classical inference, computational logic, computational linguistics, computational molecular biology, DNA sequences, mining linguistic and biological texts, abduction, assumptions, genetic code.

DESCRIPTION OF THE SCIENTIFIC AND TECHNOLOGICAL HIGHLIGHTS AND MILESTONES

* Interdisciplinary work of many years for discovering DNA bar-codes was completed and published in the most prestigious journal in the field ([1]).

* New methods for mining texts either from natural language or from molecular biology were proposed, investigated and published, together with URV colleagues ([2]).

* Cross-fertilization between molecular biology and computational linguistics resulted in a biologically inspired model for the latter, which was peer-reviewed as well and accepted for publication by Springer-Verlag, in its most prestigious series: Lecture Notes in Computing Sciences ([3]). With this work, I also contributed training of co-author E. Maharshak.

* Parsing methodologies for Property Grammars were developed, as promised in the original work programme, and accepted for publication in the same prestigious series ([4]), as a consolidated and more accessible version of a preliminary paper presented previously at a workshop ([5]).

* Work on biomedical named entity recognition was completed and published, with colleague Fred Popowich and student Baohua Gu ([6]).

* Through supervision, I have contributed a PhD thesis ([8]) and am contributing a Master's thesis currently in preparation (by Amin Sharifi).

* Methods for teaching AI to students in the humanities have been developed and written out. ([7]).

TEACHING and TRAINING ACTIVITIES:

. I have developed a novel course to make high level AI tools accessible to researchers in the humanities and which I have taught at Universidad de Rovira i Virgili. Together with colleagues Gemma Bel Enguix and Maria Dolores Jimenez Lopez, I have distilled the main results of this unique experience (it is the first time that none of my students have any prior exposure to computing) into an article of which I am the main author, and which was submitted to AI-CIT'09. Unfortunately, this entire session of the conference was cancelled (so the paper was neither accepted nor rejected).

- I have endeavoured to inspire early stage researchers through several forms of supervision/teaching, and through feedback on their written work, both at Universidad de Tarragona and (in the framework of the cooperations with Portugal and Canada) at Universidade Nova de Lisboa and at Simon Fraser University- most notably, researcher Erez Maharshak, international students Cristina Tirnauca, Adrian Dediu, Arati Panou, Clemens Dubslaff, Joao Moura, Anh The Anh, Rukhaia Mishiko, Caroline Cardepus, Alma Barranco-Mendoza (Post-Doc), Baohua-Gu, Evgeny Skvorstsov, Nima Kaviani, Baohua Gu and Amin Sharifi. I have also participated in theses defenses and similar activities. Under my supervision, Baohua Gu completed his PhD Thesis this summer, and Amin Sharifi is expected to complete his Master's thesis this summer.

DESCRIPTION OF TRAINING AND NETWORKING ACTIVITIES

a) International Project Activity: As a fundamental part of my mandate as Marie Curie Chair of Excellence, I have devoted much effort to establish several of the collaborations promised in my research plan. In particular:

* Accioes Integradas Luso-Espagnolas: In collaboration with Dr. Pedro Barahona from Universidade Nova de Lisboa, I've put together a two-year research project within the Acciones Integradas Luso-Espagnolas, whose Portuguese half was submitted in April, and whose Spanish counterpart was submitted in early September. Spanish participants are Gemma Bel Enguix and Maria Dolores Jimenez Lopez, from Universidad de Rovira i Virgili. Lisbon participants include Ludwig Krippahl, Olivier Perriquet, Ruben Frederico Duarte Viegas. The project's theme is "Constraint- and Hypothetical-based Reasoning for Bioinformatics". This initiative was successful: our project was recently approved, and we are currently setting up the organizational details.

*Control Project: In collaboration with Laboratoire de Langage et Parole at Universite d'Aix-en-Provence and with Roskilde University in Denmark, I have completed in February 2008 all duties related to the Control Project, funded by the Danish National Science Foundation, which lasted four years. As well, we have prepared a request for continuing funding of the project. Henning Christiansen visited us at Universidad de Tarragona in late September, and I visited Philippe Blache in early September in order to put together the request for extension. Unfortunately this initiative was not successful, however we shall try to continue the project even if this does not formally proceed through the Danish Council.

* Summer School for Girls, joint project with Association for Logic Programming (contact person: Enrico Pontelli), in collaboration with NMSU and CRA.

* Preliminary discussions with the Minister of Science and Technology in Argentina, Dr. Lino Baranao, and with Department Chair Dr. Hugo Scolnik, in view of setting up a possible future collaboration.

b) Scientific Service to the International Community

* As member of the Killam Selection Committee, I have assisted the Canada Council for the Arts in identifying the winners for the prestigious Killam Prizes and Scholarships.

* As member of IMDEA's Software Scientific Advisory Board, I have participated in all its organizational and decision-making aspects and attended its plenary meetings in Madrid. I am, in addition, its link to Latin America, with the mission of establishing needed contacts with both researchers and students. In the long run, I hope to close this circle into some formal three-way collaboration with SFU; for now I am exploring the possible means, as described under International Project Activity.

* As Past President of the Association for Logic Programming (ALP), I participate in all decisions as well as providing the "memory" of the system, for the new President's benefit (this is an active, current position, despite the name).

* As Director of FLoC Inc. (Federated Logic Conferences), I likewise participate in all decisions and organizational aspects.

* As Science for Peace Board Member, likewise.

c) Promoting the incorporation of more women/ underrepresented groups into Computing Sciences:

- As mentioned in III above, I have participated in developing a proposal for the 2008 International Logic and Constraint Programming Summer School in NMSU, targeting mostly women and traditionally under-represented minorities. Thanks to the joint support of the Computing Research Association Committee on the Status of Women in Computing Research and the Coalition to Diversify Computing, we have *several full scholarships* (i.e., travel and lodging) for minority and women applicants. I am actively preparing a similar initiative for next year, in collaboration with URV and Simon Fraser which we hope to extend to other universities as well.

d) Conference Committees, refereeing activities, sessions chaired

- Program Committee Member, ForLing 2008, Tarragona.

- Reviewer, EISTA 2008.

- Reviewer, CHR'08.

- Session Chair, ForLing 2008, Tarragona.

e) Editorial Work

* Associate Editor: Computational Intelligence Journal

* Editorial Board Member: International Journal of Expert Systems

* Invited Editor for a 2008 submission to TPLP in a subfield for which its board did not have a specialist.

Participation in conferences and other scientific events

Publications

Peer-Reviewed- Articles in Journals

1. Zahariev, M., Dahl, V., Chen, W. and Levesque, A. (2009) Efficient Algorithms for the Discovery of DNA Oligonucleotide Barcodes from Sequence. To appear in: International Journal of Molecular Ecology Resources.

Peer Reviewed- Articles in Conference Proceedings

2. Bel Enguix, G., Jimenez-Lopez, M.D., and Dahl, V. (in press) Mining Linguistics and Molecular Biology Texts through Specialized Concept Formation. Poster, NLPCS'09.

5. Dahl, V. and Gu, B. (2008) On Cognitive Based Property Grammars. In Proc. CSLP 2009, Hamburg, Germany.

6. Gu, B., Popowich, F. and Dahl, V. (2008) Recognizing Biomedical Named Entities in Chinese Research Abstracts. In Proceedings of the 21th Canadian Conference on Artificial Intelligence (AI-2008). Windsor, Ontario, May 28-30, 2008.

Peer Reviewed- Books and Monographs

3. Dahl, V. and Maharshak, E. (in press) DNA Replication as a Model for Computational Linguistics. LNCS, Springer-Verlag.

4. Dahl, V., Gu, B. and Maharshak, E. (in press) A Hyprolog methodology for Property Grammars. LNCS, Springer-Verlag (this is a more accessible, monograph version of workshop paper [5]).

Submitted

7. Bel Enguix, G., Jimenez-Lopez, M.D., and Dahl, V. (2009) Teaching Logic Programming Tools for Interdisciplinary Computing. Submitted to AI-CIT'09. Unfortunately this session was cancelled, so the paper was neither accepted nor rejected.

9. Christensen, H. and Dahl, V. Abductive Logic Grammars. Submitted during this period to WoLLIC 2009.

Contributions through supervision

8. Gu, B. (2008) Recognizing Named Entities in Biomedical Text. PhD thesis, SFU.

Manuscripts in Preparation

10. Christensen, H. and Dahl, V. Abductive Logic Grammars. Enhanced version of [9], to be published in Springer Verlag's monograph series: LNAI 5514-0170.

Personal Statement on how all this came to be

I have always been fascinated by languages of any kind, as communication is a paramount activity among humans, and even "lesser" biological entities have their own sophisticated languages, as exemplified by DNA or RNA strings. But I'd never really thought seriously about enlisting my expertise for molecular biology ends, until hearing Ross Overbeek state, in his invited talk at the Logic Programming Conference in 1992, that my book with Harvey Abramson (Logic Grammars) had been extensively used around the world for finding the human genome.

I was gratefully surprised, but even though this piqued my interest, the possibility of joining in was no more than a remote dream: as a busy university professor with young children to raise and with artistic inclinations that needed to be nourished as well, I deemed the time and effort necessary to develop yet another cross-discipline speciality as out of my reach.

Then one of my music (sic!) partners, plant pathologist Andre Levesque, casually mentioned a problem he was working on, which I fancied I could help solve. Shortly after, Manuel Zahariev, and more recently, W. Chen, joined us in this research. It took us several years, but the results, completed and published in 2009, were spectacular: our work reduced the processing time of a process routinely used at Agriculture and Agri-Food Canada from six months to only fifteen minutes.

Halfway into this research, a genetic heart disease running in my family killed my stepson and threw me into a fever of studying. I audited biology courses, read all I could, explored collaborations with medical doctors and biologists, until I had to humbly conclude that the hope that combining my expertise in AI with molecular biology in view of successfully treating genetic cardiomyopathy was simply a delirious dream with exceedingly slim chances of success. By then I had published some by-products of my studying frenzy, which placed me unwittingly in the intersection of three fields: artificial intelligence, computational linguistics and computational molecular biology. Then I won my Marie Curie Chair of Excellence Award- a consolation prize- for all this. I take it with the same humbleness life has taught me to take all challenges, knowing that all roads lead to marvellous places one cannot imagine at the start, and hoping to be as helpful in this three-way, new interdisciplinary field, as I possibly can.

Acknowledgements: I've been fortunate to enlist into this quest superb students and colleagues (respectively: Maryam Bavarian, Manuel Zahariev, Alma Barranco-Mendoza, Kimberly Voll, Boahua Gu, Jiang Ye, A. Persaoud; and Andre Levesque, Fred Popowich where biological text mining is concerned, Henning Christiansen for constraint based and hypothetical reasoning). Much of our success rests solidly on previous work with other wonderful colleagues (Silvia Clerici, Susana Lilliecreutz, Marcos Elinger, Alfredo Hurtado, Gabriel Bes, Roland Sambuc, Michael McCord, Harvey Abramson, Paul Tarau, Jamie Andrews, Koen De Boschere, Luis Moniz Pereira, Michel Boyer, Philippe Blache, Fred Popowich, JiaWei Han, Patrick Saint-Dizier, J.G. Pereira Lopes, Y-N.Huang, Michael Rochemont) and students or Post-Docs (Eli Hagen, Brigitte Dorner, Andrew Fall, Renwei Li, Jo Calder, Marius Scurtescu, Stephen Rochefort, Stephen Tse, T. Yeh, Pablo Accuosto, Greg Sidebottom, Jorg Ueberla, Pierre Massicotte, Charles Brown, T. Pattabhiraman, Dulce Aguilar Solis, Joao Balsa, Diane Massam). My deep gratitude to all.