NLM in Focus

Quick Q&A with Text Mining Research Group

We ask. They tell.

From what motivates them to who inspires them to what the heck they’d be doing if they weren’t here, this team of 12 scientists shares a bit about themselves in today’s “Quick Q&A” feature.

Led by Dr. Zhiyong Lu at the National Center for Biotechnology Information (NCBI), this team does text-mining research and works to improve access to NCBI’s literature services such as PubMed and PubMed Central.

Read on to find out who lived where there was no heat, limited electricity and no interest in a college graduate who majored in math; who won the Abalone world championship in 2014; who might have been a chef if not for working here; and who takes special pride in knowing that PubMed is used by millions of users worldwide. Or jump directly to your favorite scientist:

Don Comeau  | Rezarta Dogan  |  Nicolas Fiorini  |  Alan Hsu  |  Won Gyu Kim  |  Sun Kim  |  Robert Leaman  |  Wanli Liu  |  Zhiyong Lu  |  Yifan Peng  |  Chih-Hsuan Wei  |  Lana Yeganova

Quick Q&A with Zhiyong Lu, Yifan Peng, and Lana Yeganova
Question Zhiyong Lu, PhD Yifan Peng, PhD Lana Yeganova, PhD
What is the focus of your NLM research and why is it significant?
I direct the text mining research at NCBI/NLM, investigating and developing new computational methods for extracting information from free texts in biomedicine, for example, scholarly publications and clinical notes.

I coordinate and lead the overall effort for improving biomedical literature search in PubMed. This drives accelerated discovery, which leads to better health.

My research focuses on enhancing the performance and extraction of disease-chemical and disease-mutation-genes and their drug-to-drug relations. The research exemplifies the use of biomedical text mining using available curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.

For new drugs in development, identifying relations among chemicals, diseases, mutations, and genes, and improving chemical safety has led to a growing interest in developing automatic relation extraction systems to capture these relations from the rich and rapidly growing biomedical literature.

The focus of my research is natural language processing and text mining. We supply a machine with large amounts of text and teach it how to comprehend that text.

It is fascinating to observe how computers understand natural language without actually understanding it; how mathematical models beautifully drive that understanding of natural text through frequencies and co-occurrences of words.

Most recently, my focus has been on query understanding, which involves understanding what the searcher wants and inferring the intent of the query.

What or who inspired you to pursue your career? I choose this career because research is a fun thing to do! Having the opportunity to use my research in real-world applications is a plus. A talk with a friend who received a PhD in computer science and then went into academia, and subsequently started his own company, inspired me to work in this field. I’ve had amazing teachers and mentors throughout college and into graduate school, as well as my professional life at NCBI. They inspired me directly and indirectly, through continuous challenges and scarce praise, by igniting curiosity and an unsatiated desire for knowledge, by sharing the beauty and the intuitiveness of science, by setting a personal example, and offering their priceless advice and even friendship.
How did you get started in your career? I majored in computer science as an undergrad, was introduced to machine learning and bioinformatics during my master’s program, and followed that with a PhD dissertation in biomedical natural language processing (on NLM’s very own GeneRIF data). As part of my PhD program (supported by NLM’s training grants), I received a graduate-level education in molecular biology. I started with an undergraduate degree in computer science and a master’s degree in specialized natural language processing. I then pursued my PhD study and was co-mentored by two thesis advisors, one of whom specializes in biology and the other who specializes in computer linguistics. They kindled my interest in biomedical text mining. It was 1995. Armenia was suffering the aftermath of the collapse of the Soviet Union: six hours of electricity a day, no heat, limited water, and zero demand for a college graduate with a bachelor’s degree in math.

The only attractive option was to go back to school. American University had opened its doors in Armenia in 1990 and gave me, among many others, not only acceptance and a scholarship, but a ticket to a very exciting career.

Visiting US professors served as faculty. My future advisor, James Falk, PhD, from George Washington University, encouraged me to apply to graduate school. I came to the United States in 1997 for a PhD program in mathematical optimization. After graduating in 2001, I started at NCBI/NLM and was fortunate again to have W. John Wilbur, PhD, as my mentor and advisor. Fifteen years later, NCBI/NLM continues to be one of the most academically stimulating environments.

What really gets you jazzed about science and research? Seeing our work used by millions of NCBI/NLM users worldwide every day is very rewarding. I think this quote describes it best: “We are determined that our work will make the torch of biomedical knowledge burn ever brighter. That was true last week. It is true today. It will be true tomorrow.” Francis S. Collins, MD, PhD, Director of NIH Artificial Intelligence: The excitement of being able to teach a machine to become smart(er); the ability of an artificial brain to explore millions of records and offer you the ones of most interest; the potential to discover from the literature previously unknown knowledge and associations.
If you weren’t doing this work, what other profession might you have pursued? Maybe medicine. I have done internships at IBM and Google. Probably, I would end up being a computer engineer. In order: musician, pharmacist, criminal justice professional, biologist, dancer, art collector.
Tell us something surprising about yourself. My name in native Chinese characters is so unique (one in a billion chance) that there would be no ambiguity issue with other same-name authors (Lu, Z) in PubMed.  I finished reading The Goldfinch (784 pages) in two years. I had to venture out to my friends for this. Here are a couple of responses: versatility in various social environments and the ability to not judge people.

Read other profiles: Don Comeau | Rezarta Dogan | Nicolas Fiorini | Alan Hsu | Won Gyu Kim | Zhiyong Lu | Yifan Peng | Chih-Hsuan Wei | Lana Yeganova

Quick Q&A with Robert Leaman, Wanli Liu, and Sun Kim
Question Robert Leaman, PhD Wanli Liu, PhD Sun Kim, PhD
What is the focus of your NLM research and why is it significant?
I research methods to extract information from text, such as PubMed abstracts, focusing on trainable computational models for locating and identifying biomedical terminology such as diseases, drugs, and genes.

My work answers the “What?” of “Who, What, When, Where, Why and How?” Using a trainable computational model (machine learning) means that it can be adapted to different applications as needed.

I work primarily on the name project for PubMed authors and NIH-funded principal investigators.

The project helps PubMed users search by name, which is the most frequent category of information in PubMed queries.

While human names can be highly ambiguous, we apply advanced machine learning statistical methods to disambiguate similar author names. We have achieved and published state-of-the-art performance data and are working to make further improvements.

I am also involved in other information retrieval projects—for example, MeSH term work with popular deep-learning techniques.

My research focus is on semantics.

What is the meaning of a word? What is the relationship between words, phrases, or sentences in biomedical literature?

These are fundamental questions in natural language processing because the journey to find the answers helps identify more relevant information in PubMed and PubMed Central (PMC) documents.

What or who inspired you to pursue your career? I attended a computer programming camp in 4th grade and was quickly hooked.

I loved my undergraduate class in artificial intelligence. That was the first thing I explored when I returned for a graduate degree.

I was inspired by the overall working spirit of NCBI, especially, the leadership of David J. Lipman, MD, David Landsman, PhD, and those helping with my projects, John Wilbur, MD, PhD, and Zhiyong Lu, PhD. Their enthusiasm and professionalism encouraged me to pursue research in this field. Friends, colleagues, and basically everything around me. When I entered college, computer science was not my first preference, but starting from there, it was natural to study machine learning and text mining.
How did you get started in your career? My PhD advisor heard about a workshop (BioCreative) offering a shared task in identifying gene names. The task showed us that solving many information extraction problems depends on first being able to locate important terms. We decided not only to focus there but to release our systems open source so others could use our work directly. Right before graduating from my PhD program, I heard of an open research position at NCBI. Since my PhD study is related to machine learning techniques, I can continue to work in this field and apply my related skills. I was always a science loving guy, but I would say my first contact with Apple computers and basic programming in middle school opened my eyes at that time
What really gets you jazzed about science and research? I enjoy creating new methods that enable solutions to problems not addressed before. Machine learning and natural language processing are developing quickly and with broad applications. Recent developments, such as deep learning, are gaining popularity. I am excited to explore these latest techniques. Because it is like a never-ending story. Something I did not know excites me.
If you weren’t doing this work, what other profession might you have pursued? Before starting my PhD, I worked in industry as a software engineer automating the robotics used to build semiconductors. I also trained as a computer architect. I would have designed computer chips with a focus on improving system performance. Maybe astronomy? I enjoyed watching the sky and making a small telescope when I was about 10 years old.
Tell us something surprising about yourself. I also enjoy cooking. Kitchen experiments have a faster turnaround time and the results are often delicious. While others can workout when listening to music, I can read English news while listening to Chinese talk shows. I should sleep a minimum of 8 hours every day.

Read other profiles: Alan Hsu | Won Gyu Kim | Sun Kim | Robert Leaman | Wanli Liu | Zhiyong Lu | Yifan Peng | Chih-Hsuan Wei | Lana Yeganova

Quick Q&A with Rezarta Dogan, Nicolas Fiorini, and Don Comeau
Question Rezarta Dogan, PhD Nicolas Fiorini, PhD Don Comeau, PhD
What is the focus of your NLM research and why is it significant?
My passion for data analytics is apparent on my work for the log data research project, where I study user behaviors and system responses. I lead the study on biomedical abbreviations.

I am active with the BioC project, which facilitates data sharing and annotations, fostering interoperability between systems, tools, and research groups.

Other work involves developing algorithms for recognizing biomedical entities (such as diseases) and relations (such as genetic interactions) in unstructured text (PubMed).

I focus on improving PubMed’s relevance search.

PubMed responds daily to millions of queries in biomedical literature. Given the current number of papers (more than 26 million), retrieving the most relevant ones for a given query is a challenging task. It involves text mining, natural language processing, machine learning, and algorithm optimization.

I primarily provide keyword indexing for Bookshelf  and PubMed Central. Many important concepts are better described by phrases than individual words. Formal ontologies are invaluable, but they naturally lag cutting-edge usage. We can identify new, meaningful phrases when they are first used.
What or who inspired you to pursue your career? My passions have always included analytics, medicine, and books. Our tiny apartment in Albania was filled with books in every possible corner. Reading was encouraged and understanding was required.

I learned research by observing my farther. He was a respected medical doctor, whose work made him renowned around the country. As a result, people would knock on our door at every hour. He was thorough, patient, and dedicated. He kept detailed notes on every case: the signs and symptoms, the individual, their family, and their living conditions. All these would be compiled, every evening, in tables and reports. Through his diligent work he championed many health policy initiatives, which later were adopted by the whole country

I wanted to stop studying after receiving my bachelor’s degree in biology, but a teacher suggested volunteering for an unpaid internship in a lab during the summer and using the time to see what I wanted to do next.

I did that and went on to get a master’s in bioinformatics, followed by an internship at the European Bioinformatics Institute (the European counterpart to our NCBI), and then a PhD in computer science.

John Wilbur’s wife and my wife worked at the same company. John, who works here at NLM and NCBI, and I became acquainted at our wives’ office parties. Neither of us had anyone else to talk to. His work was fascinating!
How did you get started in your career? In 2002, as a graduate student in computer science at the University of Maryland in College Park, I attended a lecture by Teresa Przytycka, PhD, who at the time was a research scientist at Johns Hopkins University. I had already gone through all the available bioinformatics courses and, true to my core, was trying to find the place where computation and analytics met health and medicine.

That day was instrumental because, instead of following the easy path of picking one of the research projects that my advisor was interested in, I expressed my strong interest in this interdisciplinary research area. The rest is history.

I’ve always been attracted to NLM, NCBI, and PubMed. I kept hearing about them during my biology studies and later in bioinformatics. They were, and still are for me, the best place to go for bioinformatics.

While studying for my PhD in computer science, I tried to find biomedical use cases to illustrate my problems while aiming at a postdoc at NLM.

Which one: computational chemist, computer science professor, or now natural language processing (NLP) researcher? My work in NLP began in 2000 when John Wilbur hired me to work in his research group
What really gets you jazzed about science and research? The good that comes out of it. The more significant the project outcome, the more restless I become until I get to the bottom of it. I love coding but I did not want to make it my job. I love theoretical research, but I could not spend 100% of my time doing it.

What really excites me is that I can code, do more theoretical research, and know that everything I come up with can potentially be integrated into a PubMed portal, thus helping a lot of people.

I’ve always been too curious for my own good. Any chance to learn something new is a win. As I kid, I loved books about how things worked. My kids and I enjoyed reading the adventures of Curious George.
If you weren’t doing this work, what other profession might you have pursued? I see myself at a university teaching and guiding research projects on data and information science, data privacy, biomedical information processing, or natural language processing I think the most likely would have been web developer. I love following the new web technologies. All of my careers have involved programming.
Tell us something surprising about yourself. I love cooking. I make my own bread. You can always find me at local farm markets picking out fresh vegetables. My dream is a big kitchen and a big dining room. I won the 2014 Abalone World Championship and I plan to participate again in 2017. I also had Slash’s haircut, a few years ago. (Slash is a hard-rock guitarist with hair you’d have to see to believe.) I played bassoon up through college. Then life got busy. I dusted the bassoon off for my kid’s high school production of “The Sound of Music.”

Read other profiles: Don Comeau | Rezarta Dogan | Nicolas Fiorini | Sun Kim | Robert Leaman | Wanli Liu | Zhiyong Lu | Yifan Peng | Lana Yeganova

Quick Q&A with Alan Hsu, Won Gyu Kim, and Chih-Hsuan Wei
Question Alan Hsu, PhD Won Gyu Kim, PhD Chih-Hsuan Wei, PhD
What is the focus of your NLM research and why is it significant?
My research focuses on biomedical text mining that applies machine learning to search engine optimization and automatic extraction of biomedical named entities. This type of research helps users improve searches for related works. Machine learning and text mining research for improving the quality of databases at NCBI.
My research focuses on bioconcept recognition in text.

I develop systems and web Application Program Interfaces (API) allowing users to access our systems and generate automatic annotation results. Most of the text mining and some other bioinformatics research fields are highly dependent on the bioconcept annotations.

What or who inspired you to pursue your career? I love watching Sci-Fi movies, and there is a friend “Johnny Five” (the JavaScript Robotics & IoT Platform) that inspires me to explore machine learning Richard P. Feynman (1918-1988), scientist, teacher, raconteur, and musician. Many people inspired me, especially my supervisor and advisors.
How did you get started in your career? I participated in several competitions held by NCBI that had a great impact on my previous research. Therefore, I decided to do my postdoctoral research at NCBI. I started with a postdoc fellowship at NCBI in 1998. I started my research since my PhD period.
What really gets you jazzed about science and research? It is a great pleasure to design useful systems and assist people in understanding and discovering new knowledge. It is also an exciting challenge to develop interdisciplinary research in collaboration with life science. It does not lie. Natural language processing is amazing. How to teach a computer to extract and retrieve the information and knowledge in the text is a challenging and exciting task.
If you weren’t doing this work, what other profession might you have pursued? I would like to start a business in analyzing the sports data and open government data. Chef Professor or teacher
Tell us something surprising about yourself. When I was a child, I used to play erhu, piano, and recorder. I have two daughters who will be much taller than me. I wasn’t good at studying during my teenage and college years, but I was, and still am, interested in programming. That’s the reason I am here.