in Problems

Problem 3: Checking the no-fly list

There always seems to be an interviewer who is out to make his mark by asking the big-O question. Not that big-O question, but the one where a data structure is presented with little of the real world information that might accompany its use, and then the question is asked “What’s the big-O notation for the performance of ….” To me it is always clear that the interviewer did not hear a word I said up to that point, and that my pass/fail grade is simply whether or not I recall that it is proportional to x*ln(x) or some such, or whether I can clumsily work it out with a pencil and paper instead of looking it up in Volume 3 of Knuth which is what I would do on the job.

As an interviewer, this type of questioning doesn’t solve the problem of knowing whether the candidate can think, or to paraphrase Joel Spolsky, whether or not you would enjoy working on the same project with the candidate. Instead, we here at! have tried to come up with a useful algorithm and performance problem, very much grounded in the real world.

Problem narrative

The United States maintains a list of between 6000 and 7000 people who are not allowed to fly on commercial aircraft. This is called the No Fly List or NFL for the purpose of this problem. Let us assume that the names are unique, or that duplicate names are handled by a suffix designation that provides the uniqueness. Example: If there are two “GEORGE FLANAGIN” entries, they could be distinguished as “GEORGE FLANAGIN 12/1/1956” and “GEORGE FLANAGIN 4/22/1958”. (And so far, there are zero entries with my name, and I’m still flying!)

Each passenger aircraft has a list of 30 to 700 passengers depending on whether it is a Saab prop-jet or an Airbus A380, or something between. These passengers are identified by the same scheme used for the No Fly List. This list is the Passenger Manifest, or PM.

For the aircraft to be allowed to leave the airport, the following condition must be satisfied:

NFL ∩ PM = ∅

Essential questions

This type of problem frequently comes up in programming tasks, particularly those that involve databases where there are searches for duplicates. Most modern programming languages define an algorithm called set_intersection, and it is best to use it in production code.

But …

  1. Explain why set_intersection is not appropriate for this problem.
  2. Explain how the algorithm works.
  3. What preconditions must exist and be checked for it to operate efficiently?
  4. Give at least one alternate method of accomplishing the same goal, keeping in mind that for this problem’s scenario we need only have one passenger aboard to plane who is on the No Fly List to prevent the aircraft from flying away.

Extra credit

  1. Explain the performance characteristics of both the algorithm and the alternative as the sizes of the lists change, both absolute size and relative to each other.

This problem is about

This is a question where we can consider what constitutes an appropriate response for one’s level of experience. Far too many interview questions are a “figure it out or not” debacle.

We want to politely reject candidates who are clueless, and we want to avoid immediately going for the throat named “optimization.” Who are the inadequate candidates? Perhaps those who suggest checking everything in one list against every value in the other list. These people would tend to write “brute force” code.

A useful response goes something like this:

  1. set_intersection does too much! In fact it does way too much considering that there only needs to be one no-fly passenger on the manifest to keep the airplane on the ground.
  2. At least one of the lists needs to be sorted in order to be able to search it quickly.
  3. Ideally, both lists should be sorted. It doesn’t really matter what they are sorted on, so long as they are each sorted in the same way.
  4. Take the shorter of the lists and look for its first term in the longer list, starting at its beginning. In fact, it doesn’t matter which list you use at this point. We could simply call them “the left list” and “the right list.” If only one list is sorted, we want to look at the unsorted list.
  5. Is this term in the other list? Any time we find a match, we quit.
  6. If not found and both lists are sorted in the same way, take the “next” term in the longer list (the one just greater than the search term) and look for it in the shorter list. Essentially, reverse the context of longer-shorter lists.
  7. Work your way toward the end of both lists.

A number of progressive optimizations are possible, many of them having to do with lower_bound and upper_bound, which is why this is a useful problem for both the tyros and the masters. A person is free to answer with a degree of correctness fitting their experience level.

Not to leave the question unanswered, the problem of list size is most primitively stated this way: If one list is size x and the other is size y, then if you serially read one list and binary search in the other, you are comparing x*ln(y) and y*ln(x) — which means you should really hope it is the longer list that is sorted when you get it.