The goal of this lab is to discover the algorithms behind JOINs and to analyze the complexity.

We are going to play with synthetic data that we are going to generate.

import string, random 

def rand_str(k: int):
    """ generate a random string """
    return ''.join(random.choices(string.ascii_letters, k=k))

print("This is a random string of size 10", rand_str(10))

This is a random string of size 10 xXVilRnhJG

Complete the code of the following function

def random_relation(size: int, id_size: int, value_size: int):
    """ 
        Generate a relation of arity 2 with `size` rows of string
        with the first component contains string of length `id_size`
        and the second component, string of length `value_size`

        Should returns an iterator of couple of str.

        Example
        -------
        >>> list(random_relation(2, 2, 4))
        [('yW', 'lOep'), ('Gh', 'ZzWn')]
    """
    ...

Write down a function cartesian_product(R1, R2) that perform the cartesian product of two such relations.

For instance with:

R1 = [('gL', 'eZyH'), ('Db', 'pSDa')]
R2 = [('ai', 'GTxZ'), ('QX', 'iREd'), ('Wy', 'emon')]

the functon cartesian_product(R1, R2) should returns:

[('gL', 'eZyH', 'ai', 'GTxZ'),
 ('gL', 'eZyH', 'QX', 'iREd'),
 ('gL', 'eZyH', 'Wy', 'emon'),
 ('Db', 'pSDa', 'ai', 'GTxZ'),
 ('Db', 'pSDa', 'QX', 'iREd'),
 ('Db', 'pSDa', 'Wy', 'emon')]

Write down a naïve algorithm that perform the join of two such relations on equality of the first columns.
Plot the time taken by performing naive joins between two random relations with the following constraints:

One plot with the first relations being of size \(n\) and the second of size \(n^2\)
One plot with the two relations beings of the same size.

Takes some time to check the appropriate value for \(n\).

Perform a complexity analyzis of the problem

Sort JOIN

When two relations are sorted by their join conditions, it is possible to improve a lot the performances. Write down a SortedJoin function that:

Sort the relations
Perform a \(O(n+k)\) algorithm on the sorted relations with \(n\) and \(k\) the size of the relations.

What is the complexity of the resulting implementations?
Compare the performance with the previous implementations

Indexed Join

An index allows to retrive efficiently a row when given the id (here the first component). It is (excessively) easy to do that in Python with a dict.

Build a function IndexedJoin that compute the index for the smaller relations, then go through the largest one with the index to compute the join.

What is the complexity of the resulting implementations?
Compare the performance with the previous implementations

Correction

Too big

The following function can dump into a file a relation of arbitrary size.

def dump_relation(file_name, R):
    """ 
    Dump the relation into a tabulated separated file.
    R can be a relation or an iterator """

    with open(file_name, "w") as f:
        for a,b in R:
            f.write(f"{a}\t{b}\n")

def load_relation(file_name):
    with open(file_name) as f:
        yield from map(str.split, filter(bool, f))

We are going to hack a bit the random_relation function to amplify its size to makes those exercice a bit harder.

def inflate_relation(R, factor):
    yield from ((a, b*factor) for a, b in R)

Then you can generate two huge relations written on disk as follows dump_relation("/path/to/your/file", inflate_relation(random_relation(10**6, 3, 50), 100))

Be carefull if you have limited space on your computer as the file will be of 5Go per file. Adjust the parameter if this is too big for your computer and don’t forget to delete them after!

If you have adjusted well the parameter, the list(load_relation("path/to/your/file") should make a memory error and kill python. You can’t hold all the data in memory and need to perform your JOIN more smartely.

Try to perform a JOIN between two two big relations

Introduction to SQLite

SQLite is a proto database very usefull to execute SQL within any programs. The database itself is a single file that can be shared easily. In this lecture, we will use the lastest version: sqlite3 before moving on to PostgreSQL. You can use sqlite with:

Python/Jupyter Notebook and the
The command line (sqlite3 command on linux and other utilities on Windows)
Graphical interfaces like: https://sqlitebrowser.org/dl/

Python usage

Within a Python script you can use:

import sqlite3
# db = sqlite3.connect("path/to/some/file.db")
db = sqlite3.connect(":memory:") # create the database in memory only
c = db.cursor()
c.execute("SELECT 1,'a' UNION SELECT 2, 'b')
for x,y in c:
    print(x,y)

Command line usage

sqlite3 path/to/some/file.db  'SELECT 1,"a" UNION SELECT 2, "a"'

Error: unable to open database "path/to/some/file.db": unable to open database file

Graphical interfaces

I HAVE NO CLUE.

Try to perform the JOIN using SQLite

You can use executemany instead of execute to accelerate populating a table