Practical Database Homework: extracting information from too larges data-set

ATTENTION the dataset are kindof broken right now, I am dealing with it. You can find on the following links some extractions of a database:

full 20G
medium 495M
short 49M
very_short 51K

Those file contains research paper metadata. The full one contains actually all academic paper even published up to last month.

On the remaining, for each questions, you should prototype solution with short version and try to climb to the medium. The full version is here mostly as a challenge. You should try to handle it is very hard (but not undoable!). Expect a factor 3 to 5 between compress and uncompress version of the file.

If you don’t have the disk space, consider using a USB-key. It will slow you down but it should work nevertheless. You don’t really need to unzip those file. Python can read natively gzip file through the gzip module. I advise you to keep them compressed.

Grading

The minimal work to pass the homework is to finish the SQL part with the short file. To get a good grades you can:

Increase the dataset size
Do the advance question

I don’t expect anyone to succeed at doing the advance part with the full dataset. But please, prove me wrong!

I expect you to fetch the information you don’t know on the web. You can and should ask me question per mail if you are stuck.

It is really easy on those topics to get stuck for stupid reason for a lot of time. Please do not do this homework last minute and ask me question by mail if you are stuck for too long.

0. Explore the dataset

Download the very_short dataset and explore it to understand the structure of the file;

The list of authors is stored in a CSV cell in JSON format. Read json documentation to know more if you never heard about it. In particular, read the Python Documentation

Pythonic querying

By taking inspiration of the first lab, write the following function:

def extract_author_paper(first_name, family_name):
    """
        Return an iterator over the rows of a given author. 
    """

def get_all_authors():
    """
        Return an iterator of an arity two relation containing each authors (first name, family name)
        of the dataset without repetition: each authors should appear only once.

        Remark
        ------
            For the full version, it might not fit in RAM.
    """

SQL querying

The goal of this question is to build a sqlite-database with those data.

Propose a SQL-schema with three tables:
- One for authors (identifier, first name and family name) with each authors is stored at most once
- One for each papers (identifier, titles and document type)
- A third table giving the relationship between authors and papers
Ingest your data in your database
Implement the previous python-query with SQL

(Advance) Searching

With a mixte of Python and SQL produces a relation of arity two where:

the first field contains a word
the second field contains an identifier of a research paper

such that the word occurs in the title of the research paper. You should remove stop words. Consider using the nltk package for that or simply find somewhere a list of stop words.

For instance:

Isolation and characterization of <i>Dehalococcoides</i> sp. strain FL2, a trichloroethene (TCE)‐ and 1,2‐dichloroethene‐respiring anaerobe

Should be associated with the words isolation, characterization, dehalococcoides, etc…

Build a SQL query that find all articles that contains a list of words.
Implement a search engine for your database in Python.