On May 3 I gave a post-qualifying talk letting the department know how my research was going. It was for the work done in collaboration with my advisor Joe Blitzstein related to respondent-driven sapling (RDS). This is a process used to collect data from hard-to-reach populations, for example injection drug users or HIV infected people. RDS is used by public health agencies around the world and policy decisions are made with the results, so it is important to be able to carry out reasonable estimation with obtained data. Here is the abstract for the talk:

Respondent-driven sampling is a process of accessing a hidden population of interest via following links in the network of acquaintances belonging to the population. In our earlier simulation study, we have seen that the currently used Heckathorn estimator for RDS is comparable to and is often outperformed by the simple mean estimator that ignores the mechanism the data are obtained. In addition to performance issues, these estimators and their variants are not designed for more complicated settings when for example the RDS process explores a network of acquaintances that are connected based on the quantity being surveyed (i.e., people may make friends with individuals from the same income cohort, and in case the quantity surveyed is income, this becomes the case of homophily). Another consideration is that the existing estimators do not come with measures of variability and it has been shown that Bootstrap methods developed for that underestimate it. We attempt to approach the problem from a model-based perspective.

A sample arising from an RDS process on a network with homophily is presented on Figure 1. Observe that there is a pattern of consecutive similar observations followed by jumps. The pink line marks the true population mean for the observations. Our goal is to estimate the true population mean as accurately as possible.

Our main idea is that the similar observations, if they directly follow one another, do not carry equal amount of information about the population mean. We use this to design a model-based estimation procedure that has better variance estimation properties. This is encouraging because a point estimator is not good unless it comes with accurate error bands.

In my talk, I went over this logic in more detail and provided first simulation results that looked at 95% intervals coverage rates for our method, and Bootstrap method. Since the Bootstrap method underestimates variability, the intervals are too tight and do not cover the true population mean with adequate frequency.

We constructed histograms of coverage rates for 95% intervals obtained by running RDS processes on 100 networks with homophily, shown on Figure 2.

This is great as these first results show that we may be able to estimate variability of the estimator for RDS process better, thus making better estimation happen. We are now working on testing the approach further, and writing up the material.

The talk went well overall and generated lively discussion. My hope is that the initial results will be reasserted by further simulations and analysis of actual data.

Tags: Anchor Process, estimation, ggplot, Joe, PQ talk, RDS

[…] In fact, the intuition suggested by Figure 1 is important for working on estimation problems when sampling networks with homophily, as I described in an earlier post. […]

[…] we have done to lay foundations for our development of the new estimation method as outlined in this post. I will prepare a post describing this earlier work once we submit a paper on it, which should be […]

hi!This was a really fabulous theme!

I come from milan, I was luck to find your website in yahoo

Also I learn much in your website really thanks very much i will come again

[…] What the new view allows to explore is the structure of the process as a function of time (survey wave). It has proved very helpful in our work on the related estimation problem, as described here. […]