Advanced Python#

Iterators#


πŸ‘‰ Iterators are everywhere in Python. They are implemented within for loops, comprehensions, generators etc. but are hidden in plain sight.

πŸ‘‰ Iterator in Python is simply an object that can be iterated upon. An object which will return data, one element at a time.

πŸ‘‰ Python iterator object must implement two special methods, __iter__() and __next__(), collectively called the iterator protocol.

πŸ‘‰ An object is called iterable if we can get an iterator from it. Most built-in containers in Python like: string, list, tuple etc. are iterables.

πŸ‘‰ The iter() function (which in turn calls the __iter__() method) returns an iterator from them.

Iterating the iterator#

πŸ‘‰ We use the next() function to manually iterate through all the items of an iterator.

πŸ‘‰ When we reach the end and there is no more data to be returned, it will raise the StopIteration Exception.

# Example 1:

# define a list
my_list = [6, 9, 0, 3]  # 4 elements

# get an iterator using iter()
my_iter = iter(my_list)

# iterate through it using next()

print(next(my_iter))       # Output: 6
print(next(my_iter))       # Output: 9

# next(obj) is same as obj.__next__()

print(my_iter.__next__())  # Output: 0
print(my_iter.__next__())  # Output: 3

# This will raise error, no items left
next(my_iter)
6
9
0
3
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[1], line 20
     17 print(my_iter.__next__())  # Output: 3
     19 # This will raise error, no items left
---> 20 next(my_iter)

StopIteration: 

More elegant way of doing the same thing is for() loop

for element in my_list:  # create a function
    print(element)
6
9
0
3

Why there was not StopIteration exception ?

Let’s take a closer look at how the for is actually implemented in Python.

>>> for element in iterable:
>>>     # do something with element

Is actually implemented as.


>>> # create an iterator object from that iterable
>>> iter_obj = iter(iterable)

>>> # infinite loop
>>> while True:
>>>     try:
>>>      # get the next item
>>>         element = next(iter_obj)
>>>      # do something with element
>>>     except StopIteration:
>>>      # if StopIteration is raised, break from loop
>>>         break

So internally, the for loop creates an iterator object, iter_obj by calling iter() on the iterable.

Inside the loop, it calls next() to get the next element and executes the body of the for loop with this value.

After all the items exhaust, StopIteration is raised which is internally caught and the loop ends.

Custom Iterators#

Building an iterator from scratch is easy in Python. We just have to implement the __iter__() and the __next__() methods.

πŸ‘‰ The __iter__() method returns the iterator object itself. If required, some initialization can be performed.

πŸ‘‰ The __next__() method must return the next item in the sequence. On reaching the end, and in subsequent calls, it must raise StopIteration.

class PowTwo:
    """Class to implement an iterator
    of powers of two"""

    def __init__(self, max=0):
        self.max = max

    def __iter__(self):
        self.n = 0
        return self

    def __next__(self):
        if self.n <= self.max:
            result = 2 ** self.n
            self.n += 1
            return result
        else:
            raise StopIteration


# create an object
numbers = PowTwo(4)

# create an iterable from the object
i = iter(numbers)

# Using next to get to the next iterator element
print(next(i))
print(next(i))
print(next(i))
print(next(i))
print(next(i))
1
2
4
8
16

Generators#


πŸ‘‰ Generators are another way of creating iterators in Python. All the work in creating iterator (implementing __iter()__, __next()__ and handling StopIteration exception) are automatically handled by generators in Python.

πŸ‘‰ A generator is a function that returns an object (iterator) which we can iterate over (one value at a time).

It is fairly simple to create a generator in Python. It is as easy as defining a normal function, but with a yield statement instead of a return statement.

The difference between return and yield statements:

  1. Generator function contains one or more yield statements.

  2. When called, it returns an object (iterator) but does not start execution immediately.

  3. Methods like __iter__() and __next__() are implemented automatically. So we can iterate through the items using next().

  4. Once the function yields, the function is paused and the control is transferred to the caller.

  5. Local variables and their states are remembered between successive calls.

  6. Finally, when the function terminates, StopIteration is raised automatically on further calls.

# A simple generator function

def my_gen():
    n = 1
    print('This is printed first')
    # Generator function contains yield statements
    yield n

    n += 1
    print('This is printed second')
    yield n

    n += 1
    print('This is printed at last')
    yield n


# Using for loop
for item in my_gen():
    print(item)
This is printed first
1
This is printed second
2
This is printed at last
3

πŸ‘‰ One interesting thing to note in the above example is that the value of variable n is remembered between each call.

πŸ‘‰ Unlike normal functions, the local variables are not destroyed when the function yields.

πŸ‘‰ Furthermore, the generator object can be iterated only once.

Normally generator functionas are implemented using a loop having a suitable terminating condition.

def rev_str(my_str):
    length = len(my_str)
    for i in range(length - 1, -1, -1):
        yield my_str[i]


# For loop to reverse the string
for char in rev_str("hello"):
    print(char)
o
l
l
e
h

Generator Expressions#

πŸ‘‰ Simple generators can be easily created on the fly using generator expressions. It makes building generators easy.

πŸ‘‰ Similar to the lambda functions which create anonymous functions, generator expressions create anonymous generator functions.

TπŸ‘‰ he syntax for generator expression is similar to that of a list comprehension in Python. But the square brackets are replaced with round parentheses.

πŸ‘‰ The major difference between a list comprehension and a generator expression is that a list comprehension produces the entire list while the generator expression produces one item at a time.

πŸ‘‰ They have lazy execution ( producing items only when asked for ). For this reason, a generator expression is much more memory efficient than an equivalent list comprehension.

# Initialize the list
my_list = [1, 3, 6, 10]

# square each term using list comprehension
list_ = [x**2 for x in my_list]

# same thing can be done using a generator expression
# generator expressions are surrounded by parenthesis ()
generator = (x**2 for x in my_list)

print(list_)
print(generator)
[1, 9, 36, 100]
<generator object <genexpr> at 0x1045db100>

We can see above that the generator expression did not produce the required result immediately. Instead, it returned a generator object, which produces items only on demand.

# Initialize the list
my_list = [1, 3, 6, 10]

a = (x**2 for x in my_list)
print(next(a))

print(next(a))

print(next(a))

print(next(a))

next(a)
1
9
36
100
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[7], line 13
      9 print(next(a))
     11 print(next(a))
---> 13 next(a)

StopIteration: 

Generator advantages#

πŸ‘‰ Generator functions are easier to implement compared to itertors

  • Need not to implement iter() and next() methods like in iterators

  • Just needs yield statement

πŸ‘‰ Generator expressions are more memory efficient than list comprehensions

  • A normal function to return a sequence will create the entire sequence in memory before returning the result. This is an overkill, if the number of items in the sequence is very large.

  • Generator implementation of such sequences is memory friendly and is preferred since it only produces one item at a time.

πŸ‘‰ Can represent infinite stream

  • Generators are excellent mediums to represent an infinite stream of data.

  • Infinite streams cannot be stored in memory, and since generators produce only one item at a time, they can represent an infinite stream of data.

πŸ‘‰ Generators can be pipelined

  • Multiple generators can be used to pipeline a series of operations. This is best illustrated using an example.

Suppose we have a generator that produces the numbers in the Fibonacci series. And we have another generator for squaring numbers.

If we want to find out the sum of squares of numbers in the Fibonacci series, we can do it in the following way by pipelining the output of generator functions together.

def fibonacci_numbers(nums):
    x, y = 0, 1
    for _ in range(nums):
        x, y = y, x+y
        yield x

def square(nums):
    for num in nums:
        yield num**2

print(sum(square(fibonacci_numbers(10))))
4895

Closures#


Before understand Python Closure let us try to understand first Higher Order Functions

Higher order functions#

In Python functions are treated as first class citizens, allowing you to perform the following operations on functions:

  • A function can take one or more functions as parameters

  • A function can be returned as a result of another function

  • A function can be modified

  • A function can be assigned to a variable

In this sub-section, we will cover:

  1. Handling functions as parameters

  2. Returning functions as return value from another functions

Function as a parameter#

def sum_numbers(nums):  # normal function
    return sum(nums)    # a sad function abusing the built-in sum function :<

def higher_order_function(f, lst):  # function as a parameter
    summation = f(lst)
    return summation
result = higher_order_function(sum_numbers, [1, 2, 3, 4, 5])
print(result)       # 15

Function as a return value#

def square(x):          # a square function
    return x ** 2

def cube(x):            # a cube function
    return x ** 3

def absolute(x):        # an absolute value function
    if x >= 0:
        return x
    else:
        return -(x)

def higher_order_function(type): # a higher order function returning a function
    if type == 'square':
        return square
    elif type == 'cube':
        return cube
    elif type == 'absolute':
        return absolute

result = higher_order_function('square')
print(result(3))       # 9
result = higher_order_function('cube')
print(result(3))       # 27
result = higher_order_function('absolute')
print(result(-3))      # 3
9
27
3

You can see from the above example that the higher order function is returning different functions depending on the passed parameter

Non local variable in a nested function#

Before getting into what a closure is, we have to first understand what a nested function and nonlocal variable is.

πŸ‘‰ A function defined inside another function is called a nested function. Nested functions can access variables of the enclosing scope.

πŸ‘‰ In Python, these non-local variables are read-only by default and we must declare them explicitly as non-local (using nonlocal keyword) in order to modify them.

Following is an example of a nested function accessing a non-local variable.

def print_msg(msg):
    # This is the outer enclosing function

    def printer():
        # This is the nested function
        print(msg)

    printer()

# We execute the function
# Output: Hello
print_msg("Hello")
Hello

We can see that the nested printer() function was able to access the non-local msg variable of the enclosing function.

Defining Closures#

πŸ‘‰ Python allows a nested function to access the outer scope of the enclosing function. This is is known as a Closure.

πŸ‘‰ Closure is created by nesting a function inside another encapsulating function and then returning the inner function.

In the example above, what would happen if the last line of the function print_msg() returned the printer() function instead of calling it? This means the function was defined as follows:

def print_msg(msg):
    # This is the outer enclosing function

    def printer():
        # This is the nested function
        print(msg)

    return printer  # returns the nested function


# Now let's try calling this function.
# Output: Hello
another = print_msg("Hello")
another()
Hello

πŸ‘‰ The print_msg() function was called with the string "Hello"

πŸ‘‰ The returned function was bound to the name another.

πŸ‘‰ On calling another(), the message was still remembered although we had already finished executing the print_msg() function.

This technique by which some data (”"Hello" in this case) gets attached to the code is called closure in Python.

πŸ’‘ Value in the enclosing scope is remembered even when the variable goes out of scope or the function itself is removed from the current namespace.

Let’s delete the original function

del print_msg
another()
Hello

Here, the returned function still works even when the original function was deleted.

As seen from the above example, we have a closure in Python when a nested function references a value in its enclosing scope.

The criteria that must be met to create closure in Python are summarized in the following points.

  1. We must have a nested function (function inside a function).

  2. The nested function must refer to a value defined in the enclosing function.

  3. The enclosing function must return the nested function.

When to use closures?#

So what are closures good for?

πŸ‘‰ Closures can avoid the use of global values and provides some form of data hiding. It can also provide an object oriented solution to the problem.

πŸ‘‰ When there are few methods (one method in most cases) to be implemented in a class, closures can provide an alternate and more elegant solution. But when the number of attributes and methods get larger, it’s better to implement a class.

Decorators#


A decorator takes in a function, adds some functionality and returns it.

πŸ‘‰ Use decorators to add functionality to an existing code.

πŸ‘‰ A decorator is a design pattern in Python that allows a user to add new functionality to an existing object without modifying its structure.

πŸ‘‰ Decorators are usually called before the definition of a function you want to decorate.

This is also called metaprogramming because a part of the program tries to modify another part of the program at compile time.

Prerequisites#

In order to understand about decorators, we must first know a few basic things in Python.

πŸ‘‰ We must be comfortable with the fact that everything in Python (Yes! Even classes), are objects.

πŸ‘‰ Names that we define are simply identifiers bound to these objects.

πŸ‘‰ Functions are no exceptions, they are objects too (with attributes). Various different names can be bound to the same function object.

Recap#

Let’s uderstand, what happens when functions are assigned to a variable

def first(msg):
    print(msg)


first("Hello")

second = first
second("Hello")

When you run the code, both functions first and second give the same output.

Here, the names first and second refer to the same function object.

πŸ‘‰ As we have learned earlier, that functions can be passed as arguments Such functions that take other functions as arguments are also called higher order functions.

>>> def inc(x):
>>>     return x + 1
>>>
>>>
>>> def dec(x):
>>>     return x - 1
>>>
>>>
>>> def operate(func, x):
>>>     result = func(x)
>>>     return result
    ```

We invoke the function as follows:

```python
>>> operate(inc,3)
4
>>> operate(dec,3)
2

πŸ‘‰ A function can return another function as well

def is_called():  # created 1st function
    def is_returned():  # Created 2nd function (nested)
        print("Hello")
    return is_returned


new = is_called()

# Outputs "Hello"
new()

πŸ‘‰ An finally, Closures allows nested functions to access outer scope

# Normal function
def greeting():
    return 'Welcome to Python'
def uppercase_decorator(function):
    def wrapper():
        func = function()
        make_uppercase = func.upper()
        return make_uppercase
    return wrapper
g = uppercase_decorator(greeting)
print(g())
WELCOME TO PYTHON

You can implement above example with a decorator

'''This decorator function is a higher order function
that takes a function as a parameter'''
def uppercase_decorator(function):
    def wrapper():
        func = function()
        make_uppercase = func.upper()
        return make_uppercase
    return wrapper


@uppercase_decorator
def greeting():
    return 'Welcome to Python'
print(greeting())   # WELCOME TO PYTHON
WELCOME TO PYTHON

Decorators#

πŸ‘‰ Functions and methods are called callable as they can be called.

πŸ‘‰ In fact, any object which implements the special __call__() method is termed callable. So, in the most basic sense, a decorator is a callable that returns a callable.

Basically, a decorator takes in a function, adds some functionality and returns it.

def make_pretty(func):
    def inner():
        print("I got decorated")
        func()
    return inner


def ordinary():
    print("I am ordinary")

Calling ordinary directly is the simple method call

ordinary()
I am ordinary

But, if you decorate ordinary with make_pretty it becomes decorated, and a wrapper gets added over the ordinary method

@make_pretty
def ordinary():
    print("I am ordinary")

ordinary()
I got decorated
I am ordinary

Generally, we decorate a function and reassign it as,

>>> ordinary = make_pretty(ordinary).

This is a common construct and for this reason, Python has a syntax to simplify this.

We can use the @ symbol along with the name of the decorator function and place it above the definition of the function to be decorated.

Decorators with Parameters#

The above decorator was simple and it only worked with functions that did not have any parameters. What if we had functions that took in parameters like:

def divide(a, b):
    return a/b
def smart_divide(func):
    def inner(a, b):
        print("I am going to divide", a, "and", b)
        if b == 0:
            print("Whoops! cannot divide with 0")
            return

        return func(a, b)
    return inner


@smart_divide
def divide(a, b):
    print(a/b)
divide(2,5)
I am going to divide 2 and 5
0.4

πŸ‘‰ Parameters of the nested inner() function inside the decorator is the same as the parameters of functions it decorates. Taking this into account, now we can make general decorators that work with any number of parameters.

>>> def works_for_all(func):
>>>     def inner(*args, **kwargs):
>>>         print("I can decorate any function")
>>>         return func(*args, **kwargs)
>>>     return inner

Decorators chaining#

A function can be decorated multiple times with different (or same) decorators. We simply place the decorators above the desired function.

def star(func):
    def inner(*args, **kwargs):
        print("*" * 30)
        func(*args, **kwargs)
        print("*" * 30)
    return inner


def percent(func):
    def inner(*args, **kwargs):
        print("%" * 30)
        func(*args, **kwargs)
        print("%" * 30)
    return inner


@star
@percent
def printer(msg):
    print(msg)


printer("Hello")

Built in higher order functions#

map function#

The map() function is a built-in function that takes a function and iterable as parameters.

    # syntax
    map(function, iterable)
# Example 1: 

numbers = [1, 2, 3, 4, 5] # iterable
def square(x):
    return x ** 2
numbers_squared = map(square, numbers)
print(list(numbers_squared))    # [1, 4, 9, 16, 25]
# Lets apply it with a lambda function
numbers_squared = map(lambda x : x ** 2, numbers)
print(list(numbers_squared))    # [1, 4, 9, 16, 25]
[1, 4, 9, 16, 25]
[1, 4, 9, 16, 25]

filter function#

The filter() function calls the specified function which returns boolean for each item of the specified iterable (list). It filters the items that satisfy the filtering criteria.

    # syntax
    filter(function, iterable)
# Example 1: 

numbers = [1, 2, 3, 4, 5]  # iterable

def is_even(num):
    if num % 2 == 0:
        return True
    return False

even_numbers = filter(is_even, numbers)
print(list(even_numbers))       # [2, 4]
[2, 4]

@property Decorator#


Python programming provides us with a built-in @property decorator which makes usage of getter and setters much easier in Object-Oriented Programming.

Before going into details on what @property decorator is, let us first build an intuition on why it would be needed in the first place.

Let’s understand with a simple example

class Celsius:
    def __init__(self, temperature = 0):
        self.temperature = temperature

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

We can make objects out of this class and manipulate the temperature attribute as we wish:

# Create a new object
human = Celsius()

# Set the temperature
human.temperature = 37

# Get the temperature attribute
print(human.temperature)

# Get the to_fahrenheit method
print(human.to_fahrenheit())
37
98.60000000000001
πŸ’‘ Whenever we assign or retrieve any object attribute like temperature as shown above, Python searches it in the object's built-in __dict__ dictionary attribute.

Therefore, man.temperature internally becomes man.__dict__['temperature'].

human.__dict__
{'temperature': 37}

Now, let’s go 1 step further to implement getter and setter methods, so that we can add our own logic for Fahrenheit calculation

# Making Getters and Setter methods
class Celsius:
    def __init__(self, temperature=0):
        self.set_temperature(temperature)

    def to_fahrenheit(self):
        return (self.get_temperature() * 1.8) + 32

    # getter method
    def get_temperature(self):
        return self._temperature

    # setter method
    def set_temperature(self, value):
        if value < -273.15:
            raise ValueError("Temperature below -273.15 is not possible.")
        self._temperature = value

As we can see, the above method introduces two new get_temperature() and set_temperature() methods.

Furthermore, temperature was replaced with _temperature. An underscore _ at the beginning is used to denote private variables in Python.

Now let’s use this implementation

# Create a new object, set_temperature() internally called by __init__
human = Celsius(37)

# Get the temperature attribute via a getter
print(human.get_temperature())

# Get the to_fahrenheit method, get_temperature() called by the method itself
print(human.to_fahrenheit())

# new constraint implementation
human.set_temperature(-300)

# Get the to_fahreheit method
print(human.to_fahrenheit())
37
98.60000000000001
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 11
      8 print(human.to_fahrenheit())
     10 # new constraint implementation
---> 11 human.set_temperature(-300)
     13 # Get the to_fahreheit method
     14 print(human.to_fahrenheit())

Cell In[8], line 16, in Celsius.set_temperature(self, value)
     14 def set_temperature(self, value):
     15     if value < -273.15:
---> 16         raise ValueError("Temperature below -273.15 is not possible.")
     17     self._temperature = value

ValueError: Temperature below -273.15 is not possible.

This update successfully implemented the new restriction. We are no longer allowed to set the temperature below -273.15 degrees Celsius.

Note: The private variables don’t actually exist in Python. There are simply norms to be followed. The language itself doesn’t apply any restrictions.

>>> human._temperature = -300
>>> human.get_temperature()
-300

However, the bigger problem with the above update is that all the programs that implemented our previous class have to modify their code from obj.temperature to obj.get_temperature() and all expressions like obj.temperature = val to obj.set_temperature(val).

This refactoring can cause problems while dealing with hundreds of thousands of lines of codes.

All in all, our new update was not backwards compatible. This is where @property comes to rescue.

Property class#

We can use the @property decorator over the methods that we want to set as a property of that class

# Using @property decorator

class Celsius:
    def __init__(self, temperature=0):
        self.temperature = temperature

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

    @property
    def temperature(self):
        print("Getting value...")
        return self._temperature

    @temperature.setter
    def temperature(self, value):
        print("Setting value...")
        if value < -273.15:
            raise ValueError("Temperature below -273 is not possible")
        self._temperature = value


# create an object
human = Celsius(37)

print(human.temperature)

print(human.to_fahrenheit())

coldest_thing = Celsius(-300)

Which is equivalent to the following code:

# using property class
class Celsius:
    def __init__(self, temperature=0):
        self.temperature = temperature

    def to_fahrenheit(self):
        return (self.temperature * 1.8) + 32

    # getter
    def get_temperature(self):
        print("Getting value...")
        return self._temperature

    # setter
    def set_temperature(self, value):
        print("Setting value...")
        if value < -273.15:
            raise ValueError("Temperature below -273.15 is not possible")
        self._temperature = value

    # creating a property object
    temperature = property(get_temperature, set_temperature)


# create an object
human = Celsius(37)

print(human.temperature)

print(human.to_fahrenheit())

coldest_thing = Celsius(-300)
Setting value...
Getting value...
37
Getting value...
98.60000000000001
Setting value...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 32
     28 print(human.temperature)
     30 print(human.to_fahrenheit())
---> 32 coldest_thing = Celsius(-300)

Cell In[10], line 4, in Celsius.__init__(self, temperature)
      3 def __init__(self, temperature=0):
----> 4     self.temperature = temperature

Cell In[10], line 18, in Celsius.set_temperature(self, value)
     16 print("Setting value...")
     17 if value < -273.15:
---> 18     raise ValueError("Temperature below -273.15 is not possible")
     19 self._temperature = value

ValueError: Temperature below -273.15 is not possible

By using property, we can see that no modification is required in the implementation of the value constraint. Thus, our implementation is backward compatible.

Note: The actual temperature value is stored in the private _temperature variable. The temperature attribute is a property object which provides an interface to this private variable.

In Python, property() is a built-in function that creates and returns a property object. The syntax of this function is:

property(fget=None, fset=None, fdel=None, doc=None)

where,

  • fget is function to get value of the attribute

  • fset is function to set value of the attribute

  • fdel is function to delete the attribute

  • doc is a string (like a comment)

As seen from the implementation, these function arguments are optional.

RegEx#


A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,

>>> ^a...s$

The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

A pattern defined using RegEx can be used to match against a string.

Expression

String

Matched?

abs

No match

alias

Match

^a...s$

abyss

Match

Alias

No match

An abacus

No match

Python has a module named re to work with RegEx. Here’s an example:

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful!")	
Search successful.

re.match() function to search pattern within the test_string.

The method returns a match object if the search is successful. If not, it returns None.

Patterns#

MetaCharacters#

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here’s a list of metacharacters:

**`[] . ^ $ * + ? {} () \ |`**
[] - Square brackets#

Square brackets specifies a set of characters you wish to match.

Expression

String

Matched?

a

1 match

ac

2 matches

[abc]

Hey Jude

No match

abc de ca

5 matches

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

  • [a-c] means, a or b or c

  • [a-z] means, any letter from a to z

  • [A-Z] means, any character from A to Z

  • [0-3] means, 0 or 1 or 2 or 3

  • [0-9] means any number from 0 to 9

  • [A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9

You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

  • [^abc] means any character except a or b or c.

  • [^0-9] means any non-digit character.

. - Period#

A period matches any single character (except newline '\n').

Expression

String

Matched?

a

No match

ac

1 match

..

acd

1 match

acde

2 matches (contains 4 characters)

^ - Caret#

The caret symbol ^ is used to check if a string starts with a certain character.

  • r'^substring' eg r'^love', a sentence that starts with a word love

  • r'[^abc] means not a, not b, not c.

Expression

String

Matched?

a

1 match

^a

abc

1 match

bac

No match

β€”β€”β€”-

β€”β€”β€”β€”

β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”

^a

abc

1 match

acd

No match (starts with a but not followed by b)

$ - Dollar#

The dollar symbol $ is used to check if a string ends with a certain character.

Expression

String

Matched?

a

1 match

a$

formula

1 match

cab

No result

* - Star#

The star symbol * matches zero or more occurrences of the pattern left to it.

Expression

String

Matched?

mn

1 match

man

1 match

ma*n

maaan

1 match

main

No match (a is not followed by n)

woman

1 match

+ - Plus#

The plus symbol + matches zero or more occurrences of the pattern left to it.

Expression

String

Matched?

mn

No match (no a character)

man

1 match

ma+n

maaan

1 match

main

No match (a is not followed by n)

woman

1 match

? - Question Mark#

The question mark symbol ? matches zero or more occurrences of the pattern left to it.

Expression

String

Matched?

mn

1 match

man

1 match

ma?n

maaan

No match (more than one a character)

main

No match (a is not followed by n)

woman

1 match

{} - Braces#

We can specify the length of the substring we are looking for in a text, using a curly brackets {}.

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

  • {3}: Exactly 3 characters

  • {3,}: At least 3 characters

  • {3,8}: 3 to 8 characters

Expression

String

Matched?

abc dat

No match

abc daat

1 match (at daat)

a{2,3}

aabc daaat

2 matches (at aabc and daaat)

aabc daaaat

2 matches (at aabc and daaaat)

Let’s try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

Expression

String

Matched?

ab123csde

1 match (match at ab123csde)

[0-9]{2,4}

12 and 345673

3 matches (12, 3456, 73)

1 and 2

No match

| - Alternation#

Vertical bar | is used for alternation (or operator).

Expression

String

Matched?

cde

No match

a|b

ade

1 match (match at ade)

acdbea

3 matches (at acdbea)

Here, a|b match any string that contains either a or b

() - Group#

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c. followed by xz.

Expression

String

Matched?

ab xz

No match

(a|b|c)xz

abxz

1 match (match at abxz)

axz cabxz

2 matches (at axz cabxz)

\ - Backslash#

Backlash \ is used to escape various characters including all metacharacters. For example,

\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

Special Sequences#

Special sequences make commonly used patterns easier to write. Here’s a list of special sequences:

\A#

Matches if the specified characters are at the start of a string.

Expression

String

Matched?

\Athe

the sun

Match

In the sun

No match

\b#

Matches if the specified characters are at the beginning or end of a word.

Expression

String

Matched?

football

Match

\bfoo

a football

Match

afootball

No match

β€”β€”β€”β€”-

β€”β€”β€”β€”

———–

foo\b

the foo

Match

the afoo test

Match

the afootest

No match

\B#

Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

Expression

String

Matched?

football

No match

\Bfoo

a football

No match

afootball

Match

β€”β€”β€”β€”-

β€”β€”β€”β€”

———–

foo\B

the foo

No match

the afoo test

No match

the afootest

Match

\d#

Matches any non-decimal digit. Equivalent to [^0-9]

  • \d means: match where the string contains digits (numbers from 0-9)

  • \D means: match where the string does not contain digits

Expression

String

Matched?

\d

1ab34"50

3 matches (at 1ab34"50)

1345

No match

\s#

Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

Expression

String

Matched?

\s

Python RegEx

1 match

PythonRegEx

No match

\S#

Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

Expression

String

Matched?

\S

a b

2 matches (at a b)

No match

\w#

Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

Expression

String

Matched?

\w

12&": ;c

3 matches (at 12&": ;c)

%"> !

No match

\W#

Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_].

Expression

String

Matched?

\W

1a2%c

1 match (at 1a2%c)

Python

No match

\Z#

Matches if the specified characters are at the end of a string.

Expression

String

Matched?

I like Python

1 match

Python\Z

I like Python Programming

No match

Python is fun.

No match

Summary - MetaCharacters#

metacharacters
πŸ’‘Tip To build and test regular expressions, you can use RegEx tester tools such as **[regex101](https://regex101.com/)**. This tool not only helps you in creating regular expressions, but it also helps you learn it.

Now you understand the basics of RegEx, let’s discuss how to use RegEx in your Python code. A Regular Expression or RegEx is a special text string that helps to find patterns in data. A RegEx can be used to check if some pattern exists in a different data type. To use RegEx in python first we should import the RegEx module which is called re.

RegEx Methods#

To find a pattern we use different set of re character sets that allows to search for a match in a string.

  • re.findall: Returns a list containing all matches

  • re.split: Takes a string, splits it at the match points, returns a list

  • re.sub: Replaces one or many matches within a string

  • re.search: Returns a match object if there is one anywhere in the string, including multiline strings.

  • re.match(): searches only in the beginning of the first line of the string and returns matched objects if found, else returns None.

To use it, we need to import the module.

import re

The module defines several functions and constants to work with RegEx.

re.findall()#

The re.findall() method returns a list of strings containing all matches.

# Example 1: re.findall()

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']
['12', '89', '34']
<>:8: SyntaxWarning: invalid escape sequence '\d'
<>:8: SyntaxWarning: invalid escape sequence '\d'
/var/folders/7m/04ssj6n96q984_6wsnr60dg00000gn/T/ipykernel_17200/2464551167.py:8: SyntaxWarning: invalid escape sequence '\d'
  pattern = '\d+'

If the pattern is not found, re.findall() returns an empty list.

# Example 2: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']
['language', 'language']

As you can see, the word language was found two times in the string. Let us practice some more. Now we will look for both Python and python words in the string:

# Example 3: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']
['Python', 'python']

Since we are using re.I both lowercase and uppercase letters are included. If we do not have the re.I flag, then we will have to write our pattern differently. Let us check it out:

# Example 4: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']
['Python', 'python']
['Python', 'python']

re.split()#

The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

# Example 1: re.split()

txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol
['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']
# Example 2: re.split()

import re

string = 'Twelve:12 Eighty nine:89.'
pattern = r'\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']
['Twelve:', ' Eighty nine:', '.']

If the pattern is not found, re.split returns a list containing the original string.

You can pass maxsplit argument to the re.split method. It’s the maximum number of splits that will occur.

# Example 3: re.split()

import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = r'\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']
['Twelve:', ' Eighty nine:89 Nine:9.']

By the way, the default value of maxsplit is 0; meaning all possible splits.

re.sub()#

Syntax:

re.sub(pattern, replace, string)

The method returns a string where matched occurrences are replaced with the content of replace variable.

# Example 1: re.sub()

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = r'\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456
abc12de23f456

If the pattern is not found, re.sub() returns the original string.

You can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This will replace all occurrences.

# Example 2: re.sub()

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = r'\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6
abc12de 23 
 f45 6
# Example 3: re.sub()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# OR
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language
JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language

Let us add one more example. The following string is really hard to read unless we remove the % symbol. Replacing the % with an empty string will clean the text.

# Example 4: re.sub()

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
print(matches)
I am teacher and  I love teaching. 
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs. 
Does this motivate you to be a teacher?

re.subn()#

The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

# Example 1: re.subn()

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = r'\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)
('abc12de23f456', 4)

Match object#

You can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

import re

txt = 'I love to teach python and javaScript'
# It returns an object with span, and match
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach
<re.Match object; span=(0, 15), match='I love to teach'>
(0, 15)
0 15
I love to teach

As you can see from the example above, the pattern we are looking for (or the substring we are looking for) is I love to teach. The match function returns an object only if the text starts with the pattern.

import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None
None

The string does not string with I like to teach, therefore there was no match and the match method returned None.

match.group()#

The group() method returns the part of the string where there is a match.

# Example 6: Match object

import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = r'(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
    print(match.group())
else:
    print("pattern not found")

# Output: 801 35
801 35

Here, match variable contains a match object.

Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can get the part of the string of these parenthesized subgroups. Here’s how:

match.group(1)
'801'
match.group(2)
'35'
match.group(1, 2)
('801', '35')
match.groups()
('801', '35')

match.start(), match.end() and match.span()#

The start() function returns the index of the start of the matched substring. Similarly, end() returns the end index of the matched substring.

match.start()
2
match.end()
8

The span() function returns a tuple containing start and end index of the matched part.

match.span()
(2, 8)

match.re and match.string#

The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string.

match.re
re.compile(r'(\d{3}) (\d{2})', re.UNICODE)
match.string
'39801 356, 2102 1111'

We have covered all commonly used methods defined in the re module.

If you want to learn more, visit Python 3 re module.

Using r prefix before RegEx#

When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.

# Example 7: Raw string using r prefix

import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']
['\n', '\r']

Example of RegEx with Metacharacters#

Let us use examples to clarify the meta characters with RegEx methods:

Square Brackets []#

Let us use square bracket to include lower and upper case

# Example 1:

regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']
['Apple', 'apple']

If we want to look for the both apple and banana, we write the pattern as follows:

# Example 2:

regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']
['Apple', 'banana', 'apple', 'banana']

Using the square brackets [] and or operator | , we manage to extract Apple, apple, Banana

Escape character \#

# Example 1:

regex_pattern = r'\d'  # d is a special character which means digits
txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1', '9', '4', '2', '1', '4', '2', '0', '1', '8', '7', '6'] - this is not what we want
['8', '1', '9', '4', '2', '1', '4', '2', '0', '1', '8', '7', '6']

One or more times +#

# Example 1:

regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '14', '2018', '76'] - this is better!
['8', '1942', '14', '2018', '76']

Period .#

# Example 1:

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and Banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']
['an', 'an', 'an', 'a ', 'ar']
# Example 2: [] with +

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and Banana are fruits']
['and Banana are fruits']

Zero or more times *#

# Example 1:

regex_pattern = r'[a].*'  # . any character, * any character zero or more times 
txt = '''Apple and Banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and Banana are fruits']
['and Banana are fruits']

Zero or one time ?#

Zero or one time. The pattern may not occur or it may occur once.

# Example 1:

txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']
['e-mail', 'email', 'Email', 'E-mail']

Quantifier {}#

We can specify the length of the substring we are looking for in a text, using a curly brackets {}. Let us imagine, we are interested in a substring with a length of 4 characters:

# Example 1:

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['1942', '2018']
['1942', '2018']
# Example 2:

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'\d{1,4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '76', '14', '2018', '76']
['8', '1942', '14', '2018', '76']

Cart ^#

# Example 1: Starts with

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'^Hawking'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Hawking']
['Hawking']
# Example 2: Negation

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '14', '2018', "'", '(', '-', ')', '76']
['8', '1942', '14', '2018', "'", '(', '-', ')', '76']

*args and **kwargs#


In programming, we define a function to make a reusable code that performs similar operation. To perform that operation, we call a function with the specific value, this value is called a function argument in Python.

Suppose, we define a function for addition of 3 numbers.

# Example 1: Function to add 3 numbers

def adder(x,y,z):
    print("sum:",x+y+z)

adder(12,15,19)
sum: 46

Lets see what happens when we pass more than 3 arguments in the adder() function.

def adder(x,y,z):
    print("sum:",x+y+z)

adder(5,10,15,20,25)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-2-986e6d871071> in <module>

      2     print("sum:",x+y+z)

      3 

----> 4 adder(5,10,15,20,25)



TypeError: adder() takes 3 positional arguments but 5 were given

In the above program, we passed 5 arguments to the adder() function instead of 3 arguments due to which we got TypeError.

In Python, we can pass a variable number of arguments to a function using special symbols. There are two special symbols:

  • *args (Non Keyword Arguments)

  • **kwargs (Keyword Arguments)

We use *args and **kwargs as an argument when we are unsure about the number of arguments to pass in the functions.

*args#

As in the above example we are not sure about the number of arguments that can be passed to a function. Python has *args which allow us to pass the variable number of non keyword arguments to function.

πŸ‘‰ In the function, we should use an asterisk * before the parameter name to pass variable length arguments.

πŸ‘‰ The arguments are passed as a tuple and these passed arguments make tuple inside the function with same name as the parameter excluding asterisk *.

# Example 2: Using *args to pass the variable length arguments to the function

def adder(*num):
    sum = 0
    
    for n in num:
        sum = sum + n

    print("Sum:",sum)

adder(3,6)
adder(3,5,6,7)
adder(1,2,3,6,9)
Sum: 9
Sum: 21
Sum: 21

In the above program, we used *num as a parameter which allows us to pass variable length argument list to the adder() function.

**kwargs#

πŸ‘‰ Python passes variable length non keyword argument to function using *args but we cannot use this to pass keyword argument.

πŸ‘‰ For this problem Python has got a solution called **kwargs, it allows us to pass the variable length of keyword arguments to the function.

πŸ‘‰ In the function, we use the double asterisk ** before the parameter name to denote this type of argument.

# Example 3: Using **kwargs to pass the variable keyword arguments to the function 

def intro(**data):
    print("\nData type of argument:",type(data))

    for key, value in data.items():
        print("{} is {}".format(key,value))

intro(Firstname="Amy", Lastname="Barn", Age=24, Phone=1234567890)
intro(Firstname="Arthur", Lastname="Hunt", Email="arthurhunt@yesmail.com", Country="Atlantis", Age=27, Phone=9976563219)
Data type of argument: <class 'dict'>
Firstname is Amy
Lastname is Barn
Age is 24
Phone is 1234567890

Data type of argument: <class 'dict'>
Firstname is Arthur
Lastname is Hunt
Email is arthurhunt@yesmail.com
Country is Atlantis
Age is 27
Phone is 9976563219