Database I, année 2023
The goal of this lab is to discover the algorithms behind JOINs and to analyze the complexity.
We are going to play with synthetic data that we are going to generate.
import string, random
def rand_str(k: int):
""" generate a random string """
return ''.join(random.choices(string.ascii_letters, k=k))
print("This is a random string of size 10", rand_str(10))
This is a random string of size 10 kpGtkWJaiz
- Complete the code of the following function
def random_relation(size: int, id_size: int, value_size: int):
"""
Generate a relation of arity 2 with `size` rows of string
with the first component contains string of length `id_size`
and the second component, string of length `value_size`
Should returns an iterator of couple of str.
Example
-------
>>> list(random_relation(2, 2, 4))
[('yW', 'lOep'), ('Gh', 'ZzWn')]
"""
...
- Write down a function
cartesian_product(R1, R2)
that perform the cartesian product of two such relations.
For instance with:
= [('gL', 'eZyH'), ('Db', 'pSDa')]
R1 = [('ai', 'GTxZ'), ('QX', 'iREd'), ('Wy', 'emon')] R2
the functon cartesian_product(R1, R2)
should
returns:
'gL', 'eZyH', 'ai', 'GTxZ'),
[('gL', 'eZyH', 'QX', 'iREd'),
('gL', 'eZyH', 'Wy', 'emon'),
('Db', 'pSDa', 'ai', 'GTxZ'),
('Db', 'pSDa', 'QX', 'iREd'),
('Db', 'pSDa', 'Wy', 'emon')] (
Write down a naïve algorithm that perform the join of two such relations on equality of the first columns.
Plot the time taken by performing naive joins between two random relations with the following constraints:
- One plot with the first relations being of size \(n\) and the second of size \(n^2\)
- One plot with the two relations beings of the same size.
Takes some time to check the appropriate value for \(n\).
- Perform a complexity analyzis of the problem
Sort JOIN
When two relations are sorted by their join conditions, it is
possible to improve a lot the performances. Write down a
SortedJoin
function that:
- Sort the relations
- Perform a \(O(n+k)\) algorithm on the sorted relations with \(n\) and \(k\) the size of the relations.
- What is the complexity of the resulting implementations?
- Compare the performance with the previous implementations
Indexed Join
An index allows to retrive efficiently a row when given the
id
(here the first component). It is (excessively) easy to
do that in Python with a dict
.
Build a function IndexedJoin
that compute the index for
the smaller relations, then go through the largest one with the index to
compute the join.
- What is the complexity of the resulting implementations?
- Compare the performance with the previous implementations
Too big
The following function can dump into a file a relation of arbitrary size.
def dump_relation(file_name, R):
"""
Dump the relation into a tabulated separated file.
R can be a relation or an iterator """
with open(file_name, "w") as f:
for a,b in R:
f"{a}\t{b}\n")
f.write(
def load_relation(file_name):
with open(file_name) as f:
yield from map(str.split, filter(bool, f))
We are going to hack a bit the random_relation
function to amplify its size to makes those exercice a bit harder.
def inflate_relation(R, factor):
yield from ((a, b*factor) for a, b in R)
Then you can generate two huge relations written on disk as follows
dump_relation("/path/to/your/file", inflate_relation(random_relation(10**6, 3, 50), 100))
Be carefull if you have limited space on your computer as the file will be of 5Go per file. Adjust the parameter if this is too big for your computer and don’t forget to delete them after!
If you have adjusted well the parameter, the
list(load_relation("path/to/your/file")
should make a
memory error and kill python. You can’t hold all the data in memory and
need to perform your JOIN more smartely.
Try to perform a JOIN between two two big relations
Introduction to SQLite
SQLite
is a proto database very usefull to execute SQL
within any programs. The database itself is a single file that can be
shared easily. In this lecture, we will use the lastest version:
sqlite3
before moving on to PostgreSQL
. You
can use sqlite
with:
- Python/Jupyter Notebook and the
- The command line (
sqlite3
command on linux and other utilities on Windows) - Graphical interfaces like: https://sqlitebrowser.org/dl/
Python usage
Within a Python script you can use:
import sqlite3
# db = sqlite3.connect("path/to/some/file.db")
db = sqlite3.connect(":memory:") # create the database in memory only
c = db.cursor()
c.execute("SELECT 1,'a' UNION SELECT 2, 'b')
for x,y in c:
print(x,y)
Command line usage
sqlite3 path/to/some/file.db 'SELECT 1,"a" UNION SELECT 2, "a"'
Error: unable to open database "path/to/some/file.db": unable to open database file
Graphical interfaces
I HAVE NO CLUE.
Try to perform the JOIN using SQLite
You can use executemany
instead of execute to accelerate
populating a table
Compiled the: lun. 07 oct. 2024 15:20:54 CEST