Guidelines for beginning graduate students on how to do research

I have the great pleasure of being surrounded by amazing scientists. Over the years, I’ve learned quite a few things from my colleagues and from my own mistakes. One of the most important lessons I’ve learned is the skill of developing a research project. We aren’t explicitly taught this as graduate students and I’m hoping to write about it today. This post is specifically written for first/second year graduate students in computational biology and can probably still applicable to other data-driven fields.

I’ve split up the lessons I’ve learned into 5 guidelines.

Guideline 1: Start with a question and a data-set

I find it helpful to start with a specific question and a data-set to start the research. Questions can come from anywhere. For example, your question might have been sparked by a conversation with your colleague. Or, you might have applied a method to a data-set and found something unexpected. Coming up with a “good” question might be difficult for beginning graduate students who do not have an expectation of what is already known/non-trivial and interesting. However, from my personal experience, I find the specific details of the questions to be less important. For example, I find that after looking at the data, my initial question leads me to another more interesting question; or I find that my initial question doesn’t make much sense.

Guideline 2: be in the right place for your question

Your advisor usually has unique areas of expertise and it makes the most sense if your questions align with your advisor’s expertise. For example, it doesn’t make much sense for a graduate student to study the evolution of ground squirrels if they are in a cancer lab. We have mentors for a reason! I’ve seen beginning graduate students choose research areas that are outside their advisor’s expertise. In my experience, these projects tend to progress at a slower rate.

Guideline 3: Exploratory Data Analysis

This is a very important guideline and should be done before developing a model. This often involves visualizing the data many different ways.  From your plots, you will ask more questions, and so on. Exploratory data analysis is an iterative process. Here, critical thinking is well.. critical. You need to carefully look at your plots, isolate interesting patterns… and just keep asking the data questions.

I found that tools such as Rstudio, Snakemake, dplyr, and ggplot2 to be extremely helpful. Basically, these tools allow you to turn your questions into code and plots quite quickly.

As a beginning graduate student, I did not appreciate this guidelines, and sometimes would want to directly jump into the model building. One of my advisors (Matthew Stephens) often told me to “not put the cart before the horse”. It took me a while to appreciate the significance of this saying (I initially thought it was a British thing).

Guideline 4: Start with a simple model, and build on it.

Ever heard of the phrase “eat an elephant one bite a time”? Start with a simple model that fits that patterns you saw in the data (see Guideline 3). And then you can build on it.

Guideline 5: Talk with people

Science should not be done in isolation. I find talking to people to be crucial to the development of my projects. You do not have to talk to senior scientists, but anyone who is interested and willing to listen. It is especially helpful to talk to people who have previously analyzed the type of data you are studying. For example, in ancient dna project, the fundamental problem we were trying to solve became more clear after a discussion with my ancient-dna colleagues. Maintain good relations with your colleagues and communicate often!


There is no algorithm for research. Indeed, every research project is different. I’ve seen a project turn out quite interesting that violated Guideline 3 and 4, such that it started with the method and not the data. However, these guidelines seem to work the best for me. I am interested in other opinions too. What have you found important/helpful for doing research?


Lenski’s bacteria may never converge to the optima

In this post, I hope to write about an interesting theorem I found in the genetic algorithm literature that has interesting implications to evolutionary biology and is probably unknown to evolutionary biologists. The theorem states: the canonical genetic algorithm does not necessarily converge to the global optima. Therefore even in a perfectly constant environment and given an infinite amount of time, a population may still not reach the fitness optima.

One implication of this theorem is that the E-Coli in Richard Lenski’s long term experiment, even carried on to the end of the universe, may never actually converge to the fitness optimum. 

Even given an infinite amount of time, Lenski's Ecoli may never converge to the fitness optima :(

Even given an infinite amount of time, Lenski’s Ecoli may never converge to the fitness optima 😦

Let me first describe the basic mathematical framework.

The theorem comes from, I have copied aggressively from the paper to write this post.

In the canonical genetic algorithm, each individual in the population (of size n) is encoded as a binary tuple (i.e. a sequence of 0 and 1’s) . Proportional selection is used, which means the population from the next generation is determined by randomly sampling from the current generation, like in the Wright-Fisher model. Therefore the probability that individual b_i is selected from the population consisting of individuals (b_1, b_2,..., b_n) to be a member of the next generation is: P\{b_i\} = \frac{f(b_i)}{\sum_{j=1}^{n} f(b_j)} \ge 0.

Mutation operates independently on each individual by probabilistically perturbing each bit. Mutation occurs with probability p_m. For example, the probability that individual b=(0,0,0,0) transitions to b'=10110 by mutation is p_m(1-p_m)p_mp_m(1-p_m)=p_m^k(1-p_m)^{l-k} where l is the length of the string and k is the number of differences between b and b' Cross-over occurs with probability p_c but we skip a description of cross-over since it is not used in the proof of the theorem.

This process can be represented as a markov chain with a finite state space that consists of a population of n individuals where each individual is a bit string of length l. The transition probability is independent from time, therefore the markov chain is said to be homogenous. Thus we can describe the canonical genetic algorithm (and thus evolution) by a finite state homogenous markov chain.

One might point out that in this representations individuals (i..e chromosomes) are not binary in reality but this a matter of doubling the size of our vector. We could represent one loci with four possible values (A,C,G,T) with two loci with two possible values (10, 11, 00, 01). For example, 10 can be represented as A, 11 as G, 00 as C and 01 as T.

Now that we described the basic framework, we outline the steps of the proof.

(1) Let i be any state with less than optimal fitness and let p_i^t denote the probability of reaching that state at time t. Also because of the structure of the transition matrix, this is true: \lim \ t \to \infty \ p_i^t > 0.

(2) P(not reaching the optimal fitness at time t) \ge  p_i^t

(3) step (2) is true if and only if P(reaching the optimal fitness at time t\le  1 - p_i^t

(4) \lim \ t \to \infty P(reaching the optimal fitness at time t) =  \lim \ t \to \infty \ (1 - p_i^t) < 1

Please see the paper for the actual proof.

Genetic Algorithms and their Direct Relevance to Evolution

Genetic algorithms simulate evolution. Wikipedia says that genetic algorithms are “search heuristic that mimic the process of natural selection” but it is much more than that. For example, drift (which is not natural selection) can be seen in a simulation of a genetic algorithm. I actually believe genetic algorithms might be able to reflect nearly every phenomena observed in the natural world. But understanding evolution using genetic algorithms is definitely not straightforward as we do not know how fitness functions actually look like in nature (a long-standing problem in evolutionary biology) and even if we do, we cannot simulate such unimaginable complex fitness functions.

The genetic algorithm goes something like this (although there are many variations):

First choose an initial population of N chromosomes

Repeat the following for n generations

(1) perform crossover  (this is optional)

(2) perform mutation

(3) calculate the fitness of each chromosome (x) in the population using the fitness function f(x)

(4) perform selection

All the essential properties of evolution as understood by biologists are captured in the algorithm: variation, heritability, and differential success in leaving offspring.

I think therefore, principally every phenomenon seen in the natural world could be seen in genetic algorithms of sufficient complexity. But I don’t think the reverse logically follows.

Here I show a simulation of a genetic algorithm using the R package “GA”. The fitness function in this case is the Rastrigin function which has a large number of local optima ( I only used the 1 dimensional version for simplicity.

It looks like this (note the global optima is at 500)



Here is a run of a genetic algorithm



As you can see, it eventually reached the global optima. Note how the best fitness value increased in discrete steps and how the average fitness of the population took sometime to catch up.