Database I, année 2023
Practical Database Homework: extracting information from too larges data-set
ATTENTION the dataset are kindof broken right now, I am dealing with it. You can find on the following links some extractions of a database:
- full
20G
- medium
495M
- short
49M
- very_short 51K
Those file contains research paper metadata. The full one contains actually all academic paper even published up to last month.
On the remaining, for each questions, you should prototype solution with short version and try to climb to the medium. The full version is here mostly as a challenge. You should try to handle it is very hard (but not undoable!). Expect a factor 3 to 5 between compress and uncompress version of the file.
If you don’t have the disk space, consider using a USB-key. It will
slow you down but it should work nevertheless. You don’t really need to
unzip those file. Python can read natively gzip file through the
gzip
module. I advise you to keep them compressed.
Grading
The minimal work to pass the homework is to finish the SQL part with the short file. To get a good grades you can:
- Increase the dataset size
- Do the advance question
I don’t expect anyone to succeed at doing the advance part with the full dataset. But please, prove me wrong!
I expect you to fetch the information you don’t know on the web. You can and should ask me question per mail if you are stuck.
It is really easy on those topics to get stuck for stupid reason for a lot of time. Please do not do this homework last minute and ask me question by mail if you are stuck for too long.
0. Explore the dataset
Download the very_short dataset and explore it to understand the structure of the file;
The list of authors is stored in a CSV cell in JSON format. Read
json
documentation to know more if you never heard about
it. In particular, read the Python
Documentation
Pythonic querying
By taking inspiration of the first lab, write the following function:
def extract_author_paper(first_name, family_name):
"""
Return an iterator over the rows of a given author.
"""
def get_all_authors():
"""
Return an iterator of an arity two relation containing each authors (first name, family name)
of the dataset without repetition: each authors should appear only once.
Remark
------
For the full version, it might not fit in RAM.
"""
SQL querying
The goal of this question is to build a sqlite
-database
with those data.
Propose a SQL-schema with three tables:
- One for authors (identifier, first name and family name) with each authors is stored at most once
- One for each papers (identifier, titles and document type)
- A third table giving the relationship between authors and papers
Ingest your data in your database
Implement the previous python-query with
SQL
(Advance) Searching
With a mixte of Python
and SQL
produces a
relation of arity two where:
- the first field contains a word
- the second field contains an identifier of a research paper
such that the word occurs in the title of the research paper. You
should remove stop
words. Consider using the nltk
package for that or
simply find somewhere a list of stop words.
For instance:
Isolation and characterization of <i>Dehalococcoides</i> sp. strain FL2, a trichloroethene (TCE)‐ and 1,2‐dichloroethene‐respiring anaerobe
Should be associated with the words isolation
,
characterization
, dehalococcoides
, etc…
- Build a
SQL
query that find all articles that contains a list of words. - Implement a search engine for your database in Python.
Compiled the: mer. 27 nov. 2024 16:44:19 CET