{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Dans ce notebook nous verrons comment utiliser pandas et sqlite3 ensemble \n", "ainsi que l'impct sur les performances des GroupBy des index dans SQLite3." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "import pandas " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On crée une base de données en mémoire:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "!rm test.db\n", "db = sqlite3.connect(\"test.db\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On va créer un générateur pour peupler la table de donner artificielle. \n", "On utilise simplement des entiers et des congruences modulo pour avoir des données variée mais pas trop." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def generateur(size):\n", " for i in range(size):\n", " yield (i%1001, i%1003, i%1007)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut faire varier la taille du dataset en changeant l'argument du générateur. \n", "Ici on fait en sorte d'avoir une base de de donnée qui ne tient pas en mémoire, \n", "pour l'exercice. La génération de la table prend un certain temps" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 47s, sys: 1.11 s, total: 1min 48s\n", "Wall time: 1min 54s\n" ] } ], "source": [ "%%time \n", "db.execute(\"CREATE TABLE A(x, y, z)\") # On crée une table avec 3 colonnes\n", "db.executemany(\"INSERT INTO A values (?, ?, ?)\", generateur(100000000))\n", "db.commit() # La base de donnée fait ~ 2go. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas et SQL intégration\n", "Il est possible de faire communiquer Pandas et SQL facilement.\n", "Par exemple, pour rappatrier une table SQLite dans une data frame. \n", "Par contre, attention si la table ne tient pas en mémoire." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df = pandas.read_sql(\"SELECT * FROM A LIMIT 1000\", db) # Charge en mémoire la table (ouch si ça tient pas !) " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
0000
1111
2222
3333
4444
............
995995995995
996996996996
997997997997
998998998998
999999999999
\n", "

1000 rows × 3 columns

\n", "
" ], "text/plain": [ " x y z\n", "0 0 0 0\n", "1 1 1 1\n", "2 2 2 2\n", "3 3 3 3\n", "4 4 4 4\n", ".. ... ... ...\n", "995 995 995 995\n", "996 996 996 996\n", "997 997 997 997\n", "998 998 998 998\n", "999 999 999 999\n", "\n", "[1000 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 9s, sys: 1.6 s, total: 1min 10s\n", "Wall time: 1min 11s\n" ] } ], "source": [ "query = \"SELECT x,y,count(*) as Count FROM A GROUP BY x,y\"\n", "%time df = pandas.read_sql(query, db)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0123456789...9919929939949959969979989991000
y
0100100100100100100100100100100...9910099100991009910099100
1100100100100100100100100100100...1009910099100991009910099
299100100100100100100100100100...9910099100991009910099100
310099100100100100100100100100...1009910099100991009910099
49910099100100100100100100100...9910099100991009910099100
..................................................................
998100100100100100100100100100100...100991009910099100100100100
999100100100100100100100100100100...99100991009910099100100100
1000100100100100100100100100100100...10099100991009910099100100
1001100100100100100100100100100100...9910099100991009910099100
1002100100100100100100100100100100...1009910099100991009910099
\n", "

1003 rows × 1001 columns

\n", "
" ], "text/plain": [ "x 0 1 2 3 4 5 6 7 8 9 ... 991 \\\n", "y ... \n", "0 100 100 100 100 100 100 100 100 100 100 ... 99 \n", "1 100 100 100 100 100 100 100 100 100 100 ... 100 \n", "2 99 100 100 100 100 100 100 100 100 100 ... 99 \n", "3 100 99 100 100 100 100 100 100 100 100 ... 100 \n", "4 99 100 99 100 100 100 100 100 100 100 ... 99 \n", "... ... ... ... ... ... ... ... ... ... ... ... ... \n", "998 100 100 100 100 100 100 100 100 100 100 ... 100 \n", "999 100 100 100 100 100 100 100 100 100 100 ... 99 \n", "1000 100 100 100 100 100 100 100 100 100 100 ... 100 \n", "1001 100 100 100 100 100 100 100 100 100 100 ... 99 \n", "1002 100 100 100 100 100 100 100 100 100 100 ... 100 \n", "\n", "x 992 993 994 995 996 997 998 999 1000 \n", "y \n", "0 100 99 100 99 100 99 100 99 100 \n", "1 99 100 99 100 99 100 99 100 99 \n", "2 100 99 100 99 100 99 100 99 100 \n", "3 99 100 99 100 99 100 99 100 99 \n", "4 100 99 100 99 100 99 100 99 100 \n", "... ... ... ... ... ... ... ... ... ... \n", "998 99 100 99 100 99 100 100 100 100 \n", "999 100 99 100 99 100 99 100 100 100 \n", "1000 99 100 99 100 99 100 99 100 100 \n", "1001 100 99 100 99 100 99 100 99 100 \n", "1002 99 100 99 100 99 100 99 100 99 \n", "\n", "[1003 rows x 1001 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(\"Count\", columns=[\"x\"], index=[\"y\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Query Plan\n", "Il est possible de vérifier le plan d'execution de la requête `SQL`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6, 0, 0, 'SCAN TABLE A')\n", "(8, 0, 0, 'USE TEMP B-TREE FOR GROUP BY')\n" ] } ], "source": [ "print(\"\\n\".join(list(map(lambda e:str(e), db.execute('EXPLAIN QUERY PLAN '+query)))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SCAN TABLE indique que SQL passe au travers de la table séquentiellement. \n", "Ici, ce n'est pas un très bon plan de requêtes, à l'aide d'un index, les performances \n", "pourraient être meilleurs." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.execute(\"CREATE INDEX groupby ON A (x,y)\") " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6, 0, 0, 'SCAN TABLE A USING COVERING INDEX groupby')\n" ] } ], "source": [ "print(\"\\n\".join(list(map(lambda e:str(e), db.execute('EXPLAIN QUERY PLAN '+query)))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On voit qu'ici le plan de requête utilise l'index pour scanner la table. Ça va pas mal accélérer la requête. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.42 s, sys: 360 ms, total: 7.78 s\n", "Wall time: 7.78 s\n" ] } ], "source": [ "%time df = pandas.read_sql(query, db) # On gagne pas mal de temps ! Six fois plus rapide avec l'index." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }