in Problems

Problem 5: As unpredictable as the weather

Whether or not you intend to be an SQL programmer (ever, or for the rest of your life), you will spend a large amount of your time in IT modeling data. For now, throw the MVC meaning of model out the window, or if you happen to be on the 17th floor, just put it on the sill, and restrict yourself to the abstract problem of transforming real world data into a something you can read, search, and display.

Credit where credit is due: The idea for this problem comes to you from Martin Schweitzer, a computer scientist at Australia’s Bureau of Meteorology. Martin is one of the quietly competent people at the top end of computer science, and he has done many more favors for Scott and me than just send us a few ideas.

Problem narrative:

Let’s say we are running a weather service and we store weather observations in a “database.” We have several thousand stations around the country recording observations. Some stations record only temperature, some record only rainfall, and some record a combination of both. There are currently about 15-20 variables that we may be interested in, and any station can record any combination of these variables … but not reliably, meaning that the rain gauge may fail without the thermocouple’s simultaneous demise.

Some stations submit daily records (example: daily rainfall), some submit every six hours (i.e., 4/day). Some record every three hours, and some record every minute (example: the automatic weather stations collecting temperature, wind speed, and wind direction). For convenience and reference, we also need to keep some daily values that are submitted by the stations or calculated after the fact, such as max and min temperature each day.

And then there are the derived data: monthly rainfall, as an example.

There are a few obvious options about how we may store these data in tables. We may want to organize by time:

  • tables for 1-minute data
  • tables for hourly data
  • tables for daily data

Or we could go at it by data type:

  • tables for temperature
  • tables for rainfall
  • tables for windspeed and direction

We could also just bail and store everything as a name-value pair (NVP).

Essential questions:

Describe the aspects of the data that you would take into consideration when designing a storage repository.

Yes, we know this is not an easy problem, but the question is not to present your design in fifteen minutes or less, but to catalog the things you would think about; the questions that need an answer; the boundaries that constrain your choice.

What’s good about this problem:

  1. Beyond the fact that data design and analysis are ubiquitous, daily activities, everyone in the world contends with the weather. Rainfall, temperature, and the wind are universals in the human experience.
  2. This problem is adaptable and reusable. There is no “correct” answer, and interviewers can recycle the problem endlessly.
  3. The problem is expandable, meaning that if the candidate is interviewing for a job as a database designer, you can most certainly talk for an hour with this question as a starting point.

Some things to look for in an answer:

I have spent a lot of my time with databases — mostly SQL, but also with no-SQL databases like Cassandra. I would be most interested in where a candidate starts the discussion: Is it with keys and tables? Scale and size of the data? The queries made by users? Programmers who are embedded in OOP may offer an “implementation solution” that prescribes the implementation vehicle. The starting point will tell you a good bit about the candidate’s thinking.

The key feature of this design mini-exercise is the fact that the keys are so numerous. We may want to get the data by station, or by type, or by time interval, or by some compound criterion. There are likely to be several standard views of the data and a big need for ad hoc requests (“What was the highest absolute humidity in Richmond on the 20th of July 2011?”).