Bryce Boe

The Adventures of a UCSB Computer Science Ph.D. Student

Skip to: Content | Sidebar | Footer

Random Lines from a File

23 March, 2009 (01:36) | General | By: Bryce Boe

The following is my python implementation of choosing N random lines with equal probability from a file. This implementation is both memory and time efficient unlike other solutions. This problem was originally brought to my attention after a friend was asked a similar problem at a Google interview. Additionally a few weeks ago I was catching up on Jonathan’s blog and came across his March 2008 post about this problem. I submitted a similar solution to the one below in his comments.

Anyway, a naïve solution to this problem involves calculating a random position in the file, with the file size being the max. However this solution does not give equal probability to each line as longer lines will more likely be selected.

The simple solution involves counting the number of lines in the file and then choosing N random lines. However this requires a scan over the entire file to count the number of lines, and then in the worst case an entire scan over the file to print each random line. This is time inefficient if the file is massive in size.

One optimization is to store the seek position of all the line indexes in memory, thus avoiding the scanning to print out each random line. However this requires storing a number for all the lines in the file, which is memory inefficient if the file has a ton of lines.

By choosing whether to keep or replace the selected lines at each step in a single scan, one simply needs to store N positions into the file thus being both time and memory efficient. I’ll leave the exercise of proving that each line is selected with equal probability up to you.

Edited: Added prev so that the first line would be included, and replaced file.readline()[:-1] with file.readline().strip() to remove trailing whitespace properly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
import os, random, sys
 
if __name__ == '__main__':
    def error_exit(msg):
        sys.stderr.write(msg)
        sys.exit(1)
 
    # Verify arguments
    if len(sys.argv) != 3: error_exit('Usage: %s FILE NUM_LINES\n' %
                                      os.path.basename(sys.argv[0]))
    text_file = sys.argv[1]
    if not os.path.isfile(text_file):
        error_exit('%s does not exist, or is not a file\n' % text_file)
    try:
        num_lines = int(sys.argv[2])
    except ValueError:
        error_exit('%s is not a number\n' % sys.argv[2])
 
    seeks = [0 for x in range(num_lines)]
    file = open(text_file)
    count = 0
    prev = 0
 
    # Calculate Random Lines
    while file.readline():
        for i in range(num_lines):
            if random.randint(0, count) == 0:
                seeks[i] = prev
        prev = file.tell()
        count += 1
 
    # Print Random Lines
    for i, pos in enumerate(seeks):
        file.seek(pos)
        print '%d: %s' % (i, file.readline().strip())
 
    file.close()

Related Entries

Comments

Comment from Cheri @ Blog This Mom!
Time March 23, 2009 at 10:30 pm

Ah. I see what you mean.

Comment from Cheri @ Blog This Mom!
Time March 23, 2009 at 10:31 pm

Heh.

I totally lied.

I have no effing idea what you mean.

But I’m sure you’re right. ;-)

Comment from Bryce Boe
Time April 1, 2009 at 12:10 pm

:-D Cheri, I love your comments.

Write a comment