I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.

```
Arrays.sort(array);
```

The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?

You can keep a priority queue of the 100 biggest numbers, iterate through the billion numbers, whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

EDIT:as Dev noted, with a priority queue implemented with a heap, the complexity of insertion to queue is`O(logN)`

In the worst case you get

`billion`

log_{2}(100)which is better than`billion`

`log`

_{2}(billion)In general, if you need the largest K numbers from a set of N numbers, the complexity is

`O(NlogK)`

rather than`O(NlogN)`

, this can be very significant when K is very small comparing to N.EDIT2:The expected time of this algorithm is pretty interesting, since in each iteration an insertion may or may not occur. The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least

`i-K`

random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability. For example, lets assume the numbers were randomly selected uniformly from`{0, 1}`

, the expected value of (i-K)th number (out of i numbers) is`(i-k)/i`

, and chance of a random variable being larger than this value is`1-[(i-k)/i] = k/i`

.Thus, the expected number of insertions is:

And the expected running time can be expressed as:

(

`k`

time to generate the queue with the first`k`

elements, then`n-k`

comparisons, and the expected number of insertions as described above, each takes an average`log(k)/2`

time)Note that when

`N`

is very large comparing to`K`

, this expression is a lot closer to`n`

rather than`NlogK`

. This is somewhat intuitive, as in the case of the question, even after 10000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.If this is asked in an interview, I think the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age of within a country (e.g. China),then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an int array of size 200(maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest number. By the way this algo is called

.counting sortAnyway, making the question more specific and clearer is good for you in an interview.

You can iterate over the numbers which takes O(n)

Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.

The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.

I realized that this is tagged with 'algorithm', but will toss out some other options, since it probably should also be tagged 'interview'.

What is the source of the 1 billion numbers? If it is a database then 'select value from table order by value desc limit 100' would do the job quite nicely - there might be dialect differences.

Is this a one-off, or something that will be repeated? If repeated, how frequently? If it is a one-off and the data are in a file, then 'cat srcfile | sort (options as needed) | head -100' will have you quickly doing productive work that you are getting paid to do while the computer handles this trivial chore.

If it is repeated, you would advise picking any decent approach to get the initial answer and store / cache the results so that you could continuously be able to report the top 100.

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

Better luck on the next interview.

My immediate reaction for this would be to use a heap, but there is way to use QuickSelect without keeping all of the input values on hand at any one time.

Create an array of size 200 and fill it up with the first 200 input values. Run QuickSelect and discard the low 100, leaving you with 100 free places. Read in the next 100 input values and run QuickSelect again. Continue until you have run though the entire input in batches of 100.

At the end you have the top 100 values. For N values you have run QuickSelect roughly N/100 times. Each Quickselect cost about 200 times some constant, so the total cost is 2N times some constant. This looks linear in the size of the input to me, regardless of the parameter size that I am hardwiring to be 100 in this explanation.

You can use Quick select algorithm to find the number at the(by order) index [billion-101] and then iterate over the numbers and to find the numbers that biger from that number.

This algorithm Time is: 2 X O(N) = O(N) (Average case performance)The second option like

Thomas Jungblutsuggest is:Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).

This algorithm Time is:O(N) + 100 X O(Log(N)) = O(N)Although the other quickselect solution has been downvoted, the fact remains that quickselect will find the solution faster than using a queue of size 100. Quickselect has an expected running time of 2n + o(n), in terms of comparisons. A very simply implementation would be

This will take 3n + o(n) comparisons on average. Moreover, it can be made more efficient using the fact that quickselect will leave the largest 100 items in the array in the 100 right-most locations. So in fact, the running time can be improved to 2n+o(n).

There is the issue that this is expected running time, and not worst case, but by using a decent pivot selection strategy (e.g. pick 21 elements at random, and choose the median of those 21 as pivot), then the number of comparisons can be guaranteed with high probability to be at most (2+c)n for an arbitrarily small constant c.

In fact, by using an optimized sampling strategy (e.g. sample sqrt(n) elements at random, and choose the 99th percentile), the running time can be gotten down to (1+c)n + o(n) for arbitrarily small c (assuming that K, the number of elements to be selected is o(n)).

On the other hand, using a queue of size 100 will require O(log(100)n) comparisons, and log base 2 of 100 is approximately equal to 6.6.

If we think of this problem in the more abstract sense of choosing the largest K elements from an array of size N, where K=o(N) but both K and N go to infinity, then the running time of the quickselect version will be O(N) and the queue version will be O(N log K), so in this sense quickselect is also asymptotically superior.

In comments, it was mentioned that the queue solution will run in expected time N + K log N on a random input. Of course, the random input assumption is never valid unless the question states it explicitly. The queue solution could be made to traverse the array in a random order, but this will incur the additional cost of N calls to a random number generator as well as either permuting the entire input array or else allocating a new array of length N containing the random indices.

If the problem doesn't allow you to move around the elements in the original array, and the cost of allocating memory is high so duplicating the array is not an option, that is a different matter. But strictly in terms of running time, this is the best solution.

take the first 100 numbers of the billion and sort them. now just iterate through the billion, if the source number is higher than the smallest of 100, insert in sort order. What you end up with is something much closer to O(n) over the size of the set.

Two options:

(1) Heap (priorityQueue)

Maintain a min-heap with size of 100. Traverse the array. Once the element is smaller than first element in heap, replace it.

(2) Map-reduce model.

This is very similar to word count example in hadoop. Map job: count every element's frequency or times appeared. Reduce: Get top K element.

Usually, I would give the recruiter two answers. Give them whatever they like. Of course, map reduce coding would be labor-some because you have to know every exact parameters. No harm to practice it. Good Luck.

An very easy solution would be to iterate through the array 100 times. Which is

`O(n)`

.Each time you pull out the largest number (and change its value to the minimum value, so that you don't see it in the next iteration, or keep track of indexes of previous answers (by keeping track of indexes the original array can have multiple of the same number)). After 100 iterations, you have the 100 largest numbers.

Inspired by @ron teller's answer, here is a barebones C program to do what you want.

On my machine (core i3 with a fast SSD) it takes 25 seconds, and 1724 sorts. I generated a binary file with

`dd if=/dev/urandom/ count=1000000000 bs=1`

for this run.Obviously, there are performance issues with reading only 4 bytes at a time - from disk, but this is for example's sake. On the plus side, very little memory is needed.

The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

At the theoretical level: How many comparisons are needed for inserting in a heap. I know it is O(log(n)) but how large is the constant factor? I

At the machine level: What is the impact of caching and branch prediction on the execution time of a heap insert and a linear search in an array.

At the implementation level: What additional costs are hidden in a heap data structure supplied by a library or a compiler?

I think these are some of the questions that have to be answered before one can try to estimate the real difference between the performance of a 100 element heap or a 100 element array. So it would make sense to make an experiment and measure the real performance.

Algorithm Biggest x elements from n:I will call return value

LIST. It is a set of x elements (in my opinion that should be linked list)So, what is the worst case scenario?

x log(x) + (n-x)(log(x)+1) = nlog(x) + n - x

So that is O(n) time for worst case. The +1 is the checking if number is greater than smallest one in LIST. Expected time for average case will depend on mathematical distribution of those n elements.

Possible improvementsThis algorithm can be slightly improved for worst case scenario but IMHO (I can not prove this claim) that will degrade average behavior. Asymptotic behavior will be the same.

Improvement in this algorithm will be that we will not check if element is greater than smallest. For each element we will try to insert it and if it is smaller than smallest we will disregard it. Although that sounds preposterous if we regard only the worst case scenario we will have

x log(x) + (n-x)log(x) = nlog(x)

operations.

For this use case I don't see any further improvements. Yet you must ask yourself - what if I have to do this more than log(n) times and for different x-es? Obviously we would sort that array in O(n log(n)) and take our x element whenever we need them.

This question would be answered with N log(100) complexity (instead of N log N) with just one line of C++ code.

The final answer would be a vector where the first 100 elements are guaranteed to be the 100 biggest numbers of you array while the remaining elements are unordered

C++ STL (standard library) is quite handy for this kind of problems.

Note: I am not saying that this is the optimal solution, but it would have saved your interview.

The simple solution would be using a priority queue, adding the first 100 numbers to the queue and keeping track of the smallest number in the queue, then iterating through the other billion numbers, and each time we find one that is larger than the largest number in the priority queue, we remove the smallest number, add the new number, and again keep track of the smallest number in the queue.

If the numbers were in random order, this would work beautiful because as we iterate through a billion random numbers, it would be very rare that the next number is among the 100 largest so far. But the numbers might not be random. If the array was already sorted in ascending order then we would

alwaysinsert an element to the priority queue.So we pick say 100,000

randomnumbers from the array first. To avoid random access which might be slow, we add say 400 random groups of 250 consecutive numbers. With that random selection, we can be quite sure that very few of the remaining numbers are in the top hundred, so the execution time will be very close to that of a simple loop comparing a billion numbers to some maximum value.Finding the top 100 out of a billion numbers is best done using min-heap of 100 elements.

First prime the min-heap with the first 100 numbers encountered. min-heap will store the smallest of the first 100 numbers at the root (top).

Now as you go along the rest of the numbers only compare them with the root (smallest of the 100).

If the new number encountered is larger than root of min-heap replace the root with that number otherwise ignore it.

As part of the insertion of the new number in min-heap the smallest number in the heap will come to the top (root).

Once we have gone through all the numbers we will have the largest 100 numbers in the min-heap.

I have written up a simple solution in Python in case anyone is interested. It uses the

`bisect`

module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.Usage with 100,000,000 elements and worst-case input which is a sorted list:

It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).

I see a lot of O(N) discussions, so I propose something different just for the thought exercise.

Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.

However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.

There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this - it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.

Create an empty list of 100 empty slot

For every number in input-list:

If the number is smaller than the first one, skip

Otherwise replace it with this number

Then, push the number through adjacent swap; until it's smaller than the next one

Return the list

Note:if the`log(input-list.size) + c < 100`

, then the optimal way is to sort the input-list, then split first 100 items.THe complexity is O(N)

First create an array of 100 ints initialiaze the first element of this array as the first element of the N values, keep track of the index of the current element with a another variable, call it CurrentBig

Iterate though the N values

when done , print the M array from CurrentBig 100 times modulo 100 :-) For the student: make sure that the last line of the code does not trump valid data right before the code exits

Another O(n) algorithm -

The algorithm finds the largest 100 by elimination

consider all the million numbers in their binary representation. Start from the most significant bit. Finding if the MSB is 1 can be a done by a boolean operation multiplication with an appropriate number. If there are more than 100 1's in these million eliminate the other numbers with zeros. Now of the remaining numbers proceed with the next most significant bit. keep a count of the number of remaining numbers after elimination and proceed as long as this number is greater than 100.

The major boolean operation can be an parallely done on GPUs

I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.

You can do it in

`O(n)`

time. Just iterate through the list and keep track of the 100 biggest numbers you've seen at any given point and the minimum value in that group. When you find a new number bigger the smallest of your ten, then replace it and update your new min value of the 100 (may take a constant time of 100 to determine this each time you do it, but this does not affect the overall analysis).Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.

It's a question from Google or some else industry giants.Maybe the following code is the right answer expected by your interviewer. The time cost and space cost depend on the maximum number in the input array.For 32-Bit int array input, The maximum space cost is 4 * 125M Bytes, Time cost is 5 * Billion.

i did my own code,not sure if its what the "interviewer" it's looking

Possible improvements.If the file contains 1 billions number, reading it could be

reallylong...To improve this working you can :

First take 1000 elements and add them in a max heap. Now take out the first max 100 elements and store it somewhere. Now pick next 900 elements from the file and add them in the heap along with the last 100 highest element.

Keep repeating this process of picking up 100 elements from the heap and adding 900 elements from the file.

The final pick of 100 elements will give us the maximum 100 elements from a billion of numbers.

This code is for finding

Nlargest numbers in an.Unsorted arrayThis might not be the efficient one but gets the job done.

Hope this helps

I know this might get buried, but here is my idea for a variation on a

`radix MSD`

.`pseudo-code:`

The function

`getMsdIdx(int num)`

would return the index of the most significant digit (non-zero). The function`getMsd(int num)`

would return the most significant digit. The funciton`removeMSD(int num)`

would remove the most significant digit from a number and return the number (or return null if there was nothing left after removing the most significant digit).Once this is done, all that is left is traversing

`mynums`

to grab the top 100 digits. This would be something like:I should note that although the above looks like it has high time complexity, it will really only be around

`O(7*100)`

.A quick explanation of what this is trying to do: Essentially this system is trying to use every digit in a 2d-array based upon the index of the digit in the number, and the digit's value. It uses these as indexes to keep track of how many numbers of that value have been inserted in the array. When 100 has been reached, it closes off all "lower branches".

The time of this algorithm is something like

`O(billion*log(16)*7)+O(100)`

. I could be wrong about that. Also it is very likely this needs debugging as it is kinda complex and I just wrote it off the top of my head.EDIT: Downvotes without explanation are not helpful. If you think this answer is incorrect, please leave a comment why. Pretty sure that StackOverflow even tells you to do so when you downvote.