Posts Tagged ‘Processing’

Effective way for data scientists to grow impact

Sunday, October 5th, 2014

In order to get things done people need to communicate effectively. At school, teachers present to students. In consulting, consultants make powerpoint slide decks. In research, researchers make presentations and talks to spread their ideas.

When it comes to data scientists, many of us write code (in R, Python, Julia etc) in order to analyze data, inform decisions. To many people, what we do is rocket science.

What is the most effective and easy way to spread our ideas and grow impact? A good answer is interactive visualization. And not just for data scientists, but for anyone working with analytics.

Sure enough, pretty and intuitive graphics are a good way to deliver insight. And, with modern technologies interactive visualization can grow into products, viral marketing campaigns, and journalism pieces.

I have been doing interactive visualization for a while. Below is a visualization I made to explore geographic enrolment patterns of HarvardX. What started as an exploratory project ended up as a product -- an interactive analytics platform we called HarvardX Insights. It ended up on the cover of Campus Technology, and several universities from around the world contacted HarvardX to get the code.

See databit HarvardX Certificates World Map by Sergiy Nesterko on Databits.

And here is something for data scientists -- a visualization of the Hamiltonian Monte Carlo algorithm. I taught it to my students last year during a graduate course on statistical computing and interactive visualization at Harvard Statistics. This visualization was one of several I created for the course together with students.

See databit Hamiltonian (Hybrid) Monte Carlo by Sergiy Nesterko on Databits.

People who work with data increasingly need to acquire and apply creative coding skills in order to put their ideas to work. This helps come closer to the end user of an analytic insight, and avoid possible operational distortions and dead ends along the way. That's why, resources that promote and teach creative coding are in high demand among my peer data scientists. I am a big fan of Mike Bostock's Blocks, and other resources such as Codepen, JSFiddle, and Stack Overflow.

Recently, I have been using and contributing to Databits more and more. Databits is a website for data scientists, data journalists, and other creative coders to share work, connect, and grow impact. I believe that eventually, the site will allow to be more targeted and specifically learn from and follow peer data scientists and other creative coders who are focused on producing effective interactive visualization and other cool stuff. For example, I look forward to learning some Processing applications from this guy. In the meanwhile, I helped put together a simple databit based on Processing.js:

See databit Processing.js Hello World Sketch by Sergiy Nesterko on Databits.

The site also runs Challenges, an initiative aimed at finding meaningful problems for creative data scientists to solve, and put on their portfolios. I find this pretty cool.

I look forward to learning new things, finding cool problems, and making the world a better place with data. Now my creative work has a home -- you can check out my creative endeavors and interests on my Databits profile page.

JSM2011, and a final stretch at RDS

Thursday, August 18th, 2011

The Joint Statistical Meetings conference took place in Miami Beach on July 30-August 5. It went very well, and the definite highlight was the keynote lecture by Sir David Cox. Among the other sessions, the following stand out:

  1. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics by Georg M. Goerg, a student at CMU Stat. The talk was very enjoyable and the conveyed ideas were crisp and exciting, the main one being that zero-mean time series can be thought of as histograms by representing them as frequency distributions which allows for an elegant non-parametric classification approach by minimizing the KL divergence of observed and simulated frequency histograms.
  2. Large Scale Data at Facebook by Eric Sun from Facebook. Though not groundbreaking, the talk was exciting as it described the work environment at Facebook and the approach taken to getting signals out of massive data. Mostly, curious facts were presented from analyzing the frequencies of word occurrences in user status updates, with the interesting part being the analysis framework developed to do that.
  3. Jointly Modeling Homophily in Networks and Recruitment Patterns in Respondent-Driven Sampling of Networks by my advisor Joe Blitzstein about our most recent research on model-based estimation for Respondent-Driven Sampling (RDS). The approach we are developing is looking to have several very attractive features in comparison to current estimation techniques and is designed for the case of homophily of varying degree. An example is illustrated on Figure 1.

    Figure 1: An example of homophily, with the network plotted over the histogram of the homophily inducing quantity (left), and resulting (normalized) vertex degrees plotted over the same histogram (right).

    We hope to finish the relevant paper soon and open the approach to extensions by the research community.

During the conference, I also had a chance to finish making a dynamic 3D visualization of a constrained optimization algorithm I developed for In4mation Insights, which is exciting. As for Miami Beach itself, it is a great place to go out and enjoy the good food, sun and beach. JSM2012 will be held in San Diego.

I created the visualization in this post using Processing.

theory.info, a new project

Tuesday, July 12th, 2011


Recently I purchased the domain and created an interactive logo/visualization for Theory Information Analysis, a screenshot of which is presented above. Theory is a new project which I would like to represent applied real word work, including quantitative consulting and applied research. (more…)

Dynamic visualization, paper supplement 1

Saturday, May 28th, 2011

(more…)

Dynamic visualization, paper supplement 2

Saturday, May 28th, 2011

(more…)

Dynamic visualization of RDS version 2

Sunday, March 27th, 2011

Early this semester, I worked on complementing my visualization of the Respondent-Driven Sampling (RDS) process presented in this post to illustrate its evolution over time. That was how the second version was created, which is displayed here.

Please refer to the earlier post for detailed description of the main functionality. The second version implements an additional view of the process, which plots the portion of the underlying network as discovered by the RDS process over time. To switch to an alternate view at any time, press the change view button. The wide pink horizontal line in the alternate view marks the true population mean. (more…)

Dynamic visualization of RDS

Saturday, December 18th, 2010

The visualization below is the last element of work with my advisor Joe Blitzstein on exploring the Respondent-Driven Sampling (RDS) process via simulation. (more…)

Tradeoffs in estimation under Respondent-Driven Sampling, and Chernoff faces

Wednesday, October 6th, 2010

Recently I have been working hard on finalizing the paper that we are writing with my advisor Joe Blitzstein about estimation under Respondent-Driven Sampling (RDS). Specifically, the paper aims to develop general intuition about how the process works on networks with different topologies, and what are the driving factors of current estimators' performance (or lack thereof).

To do this, we simulated many networks belonging to one of three main types (homophily, rich-gets-richer and inverse homophily), simulated many RDS processes of different configurations on each, and compared performance of the well-established Volz-Heckathorn (VH) estimator, and plain vanilla mean as point estimators under each scenario. Among other findings, it has turned out that the VH estimator underperforms the plain mean on the considered class of homophily networks, and prevails in some other cases. (more…)

Visualizing while on Opening Workshop on Complex Networks at SAMSI

Tuesday, August 31st, 2010

It is now almost the end of my stay here in Research Triangle Park, NC at the Opening Workshop on Complex Networks organized by SAMSI. I presented a poster here on some of my work with Joe Blitzstein on estimation under respondent-driven sampling. This was about simulation studies we have done to lay foundations for our development of the new estimation method as outlined in this post. I will prepare a post describing this earlier work once we submit a paper on it, which should be soon. I also had a pleasure to meet other researchers working in the field, in particular Matt Salganik and Erik Volz. It was really enjoyable and inspiring to discuss problems relevant to estimation in RDS.

Apart from enjoying the workshop, I have had a chance to enjoy some Processing and experimented with some ideas about visualizing high dimensional dependent data (that is, when the number of dimensions is larger than 3). (more…)

Networks with homophily, an interesting visualization

Tuesday, June 8th, 2010

The research I am currently involved in with my advisor Joe Blitzstein concerns networks with homophily. As per Wiki:

Homophily (i.e., love of the same) is the tendency of individuals to associate and bond with similar others. The presence of homophily has been discovered in a vast array of network studies. (more…)