Random Lines from a File
Update 2011-07-18: As Ernesto correctly points out in the comments, this algorithm selects N independent random lines from a file.
The following is my python implementation of choosing N random lines with equal probability from a file. This implementation is both memory and time efficient unlike other solutions. This problem was originally brought to my attention after a friend was asked a similar problem at a Google interview. Additionally a few weeks ago I was catching up on Jonathan’s blog and came across his March 2008 post about this problem. I submitted a similar solution to the one below in his comments.
Anyway, a naïve solution to this problem involves calculating a random position in the file, with the file size being the max. However this solution does not give equal probability to each line as longer lines will more likely be selected.
The simple solution involves counting the number of lines in the file and then choosing N random lines. However this requires a scan over the entire file to count the number of lines, and then in the worst case an entire scan over the file to print each random line. This is time inefficient if the file is massive in size.
One optimization is to store the seek position of all the line indexes in memory, thus avoiding the scanning to print out each random line. However this requires storing a number for all the lines in the file, which is memory inefficient if the file has a ton of lines.
By choosing whether to keep or replace the selected lines at each step in a single scan, one simply needs to store N positions into the file thus being both time and memory efficient. I’ll leave the exercise of proving that each line is selected with equal probability up to you.
Edited: Added prev so that the first line would be included, and replaced file.readline()[:-1] with file.readline().strip() to remove trailing whitespace properly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python import os, random, sys if __name__ == '__main__': def error_exit(msg): sys.stderr.write(msg) sys.exit(1) # Verify arguments if len(sys.argv) != 3: error_exit('Usage: %s FILE NUM_LINES\n' % os.path.basename(sys.argv[0])) text_file = sys.argv[1] if not os.path.isfile(text_file): error_exit('%s does not exist, or is not a file\n' % text_file) try: num_lines = int(sys.argv[2]) except ValueError: error_exit('%s is not a number\n' % sys.argv[2]) seeks = [0 for x in range(num_lines)] file = open(text_file) count = 0 prev = 0 # Calculate Random Lines while file.readline(): for i in range(num_lines): if random.randint(0, count) == 0: seeks[i] = prev prev = file.tell() count += 1 # Print Random Lines for i, pos in enumerate(seeks): file.seek(pos) print '%d: %s' % (i, file.readline().strip()) file.close() |
Related Entries
Comments
Comment from Cheri @ Blog This Mom!
Time 2009/03/23 at 10:31 PM
Heh.
I totally lied.
I have no effing idea what you mean.
But I’m sure you’re right.
Comment from Melissa
Time 2010/12/19 at 10:07 PM
Thanks but this script is producing repeating random lines.
Comment from Bryce Boe
Time 2010/12/20 at 5:48 AM
Melissa,
Unfortunately that is the case because each line selection is independant.
Comment from little miss savage
Time 2011/05/06 at 10:44 AM
this is really useful for my research. Google recommended checking this out since it was from my “social circle” and it was actually a good pick.
Nice coding and good explanation.
Comment from Ernesto
Time 2011/07/18 at 8:47 AM
This algorithm seems to be incorrect. It selects duplicate lines. A true algorithm for this purpose would really select n different lines out of a total of N, and would be unbiased in selecting any of the possible combinations of n out of N.
(When I say different, I mean actual different lines given their line numbers. If two different lines happen to have the same content, they are still considered different).
A correct algorithm form Alan Waterman is exposed in the second volume of Donald Knuth’s “The Art of Computer Programming”. The algorithm even works without knowing in advance the total number of lines of the file. There’s also a simpler algorithm if we know in advance the number of lines, and it is still not obvious as a naive solution would be.
Comment from Bryce Boe
Time 2011/07/18 at 9:48 PM
Ernesto-
Thanks for the comment. The algorithm is correct for selecting N independent random lines, however, I never stated that thus I can see where some may be confused. I’ll update the post with a clarification.
Thanks for the pointer to an algorithm for a “correct” solution. There may be a follow up blog post in the near future.
Comment from Cheri @ Blog This Mom!
Time 2009/03/23 at 10:30 PM
Ah. I see what you mean.