Starting with a Clean Slate

“A generator in Python is a way to create a sequence of values that can be generated  once at a time instead of all at once” 

This can be useful when working with large amounts of data or when you don’t want to store all values in memory at the same time.

Let us learn through an example which clears up the concept of generators. Let’s say you have a large file containing a million lines of text and you want to print out each line which contains a specific word given by the user. We can do it with this approach.

filename = 'test.txt'
word = 'apple'
with open(filename) as file:
    contents = file.readlines()
for line in contents:
    if word in line:
        print(line)
  • The above code runs this way:
        • All  the contents of the file are read into contents
        • A specific word is searched in each line of the contents.
        • If the word is found, the containing line is printed.

The problem with this solution arises when the file is too large. In the above code, we are trying to store all the contents of the file inside a single variable but where memory management is important, this code is very inefficient.

Using a Generator

We will see a code which utilizes a generator to solve the problem.

def find_lines(filename, word):
    with open(filename) as file:
        for line in file:
            if word in line:
                yield line

lines_with_word = find_lines('test.txt','apple')
for line in lines_with_word:
    print(line)

In this code, if the word is present then the line is yielded by the generator using the yield keyword. The above code will read the file line by line and only yield the lines that contain apple.

Why did we use yield instead of return?

If we were to use return instead of yield the generator function would terminate and return only the first matching line encountered and we would loose the ability to continue generating subsequent matching lines.

The Holy Grail!

What exactly is yield? 

A yield keyword allows a function to generate a sequence of values, one at a time without losing their internal state. Thus a function with a yield keyword is a generator function.

When a function encounters yield statement, it temporarily suspends it’s execution and returns the yielded value as a result.Unlike a regular return statement the function’s state is saved allowing it to resume execution from where it left off when the next value is requested.

def generate_numbers():
    yield 1
    yield 2
    yield 3
numbers = generate_numbers()

In this example, the generate_numbers() function is a generator function. It uses yield to define three points in the code where the function will yield a value. When generate_numbers() is called, it doesn’t execute the function body immediately. Instead, it returns a generator object.

To retrieve the values generated by the generator, we can use a for loop or the next() function:

print(next(numbers))  # Output: 1
print(next(numbers))  # Output: 2
print(next(numbers))  # Output: 3

Here, everytime the yield statement is ran through during iteration or when next() is called, the generator function’s execution is paused and a specified value is yielded and the generator’s state is saved. The present state of the generator will be the present number it is working on. When the next value is requested, the generator resumes execution from where it left off, continuing the loop or subsequent statements.

Not using generator vs using a generator

Point to be noted

  • Instead of storing and processing all the numbers (lines in our previous example) at once, a generator operates by storing and processing each line individually, one at a time. After yielding a value, it can be consumed or processed and subsequently removed from memory.

Summary:

Generators are a powerful feature in Python for working with large datasets or files. They provide a memory efficient and on-demand generation of values making them suitable for various tasks such as data handling/processing and efficient iteration.

Leave a Reply

Your email address will not be published. Required fields are marked *