IR - 11 - Lab I

Lecture Info

Data: [2019-12-06 ven]
Sito corso: link
Introduzione: In questa prima lezione di laboratorio abbiamo visto del codice python per la costruzione di due indici inversi: uno non posizionale e l'altro posizionale.

1 Unigram Inverted Index

In questo esempio abbiamo indicizzato un insieme di documenti utilizzando un indice inverso non posizionale per poi implementare un una nostra query language. Il codice è stato in parte preso dalla seguente github repo, ed è disponibile sia come jupyter notebook, e sia come normale file python.

1.1 Imports

Le librerie utilizzate sono le seguenti

 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize
 from nltk.stem import PorterStemmer

 import nltk
 import os
 import string
 import numpy as np
 import copy
 import pandas as pd
 import pickle

1.2 Data Extraction

Il dataset utilizzato è il 20 NewsGroup, che rappresenta una collezione di quasi $20.000$ articoli di news partizionati in modo uniforme rispetto a $20$ diverse categorie. Per scaricarlo è possibile eseguire il seguente codice bash

curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz -o 20_newsgroups
tar -xvf 20_newsgroups

Il codice python che segue deve essere eseguito all'interno della stessa cartella in cui è stato scaricato il dataset. Si assume inoltre che la cartella contenente il dataset è chiamata 20_newsgroups . Tale codice non fa altro che inserire in una lista le paths di tutti i documenti da indicizzare.

title = "20_newsgroups"
paths = []
for (dirpath, dirnames, filenames) in os.walk(str(os.getcwd()) + '/' + title + '/'):
    for i in filenames:
        paths.append(str(dirpath)+str("/") + i)

Una volta eseguito nella lista paths abbiamo quindi tutte le paths salvate nella lista paths .

print(paths[0]) 
# home/leo/repos/ir/private/code/lab_1/20_newsgroups/comp.graphics/38704
print(paths[1]) 
# /home/leo/repos/ir/private/code/lab_1/20_newsgroups/comp.graphics/38352

1.3 Pre-Processing

Andiamo adesso a definire la catena di preprocessamento che vogliamo applicare al contenuto di ciascun documento prima di andarlo ad indicizzare.

def preprocess(data, query):
    if not query:
        data = remove_header(data)
        data = convert_lower_case(data)
        data = convert_numbers(data)
        data = remove_punctuation(data) # remove comma seperately
        data = remove_stop_words(data)
        data = remove_apostrophe(data)
        data = remove_single_characters(data)
        data = stemming(data)
    return data

Notiamo come l'ordine in cui effettuiamo questi steps è importante. Segue quindi l'implementazione di ciascun modulo.

1.3.1 Remove Header

Ogni articolo presente nel dataset è formato da una serie di metadati iniziali. Per separare i metadati dal contenuto effettivo del testo sono utilizzati due newline characters \n .

Ad esempio, l'articolo 20_newsgroups/comp.graphics/37261 ha il seguente formato

La rimozione dell'header viene quindi effettuata con la seguente funzione.

def remove_header(data):
    try:
        ind = data.index('\n\n')
        data = data[ind:]
    except:
        print("No Header")
    return data

1.3.2 Convert Lower Case

def convert_lower_case(data):
    return np.char.lower(data)

1.3.3 Convert Numbers

def convert_numbers(data):
    data = np.char.replace(data, "0", " zero ")
    data = np.char.replace(data, "1", " one ")
    data = np.char.replace(data, "2", " two ")
    data = np.char.replace(data, "3", " three ")
    data = np.char.replace(data, "4", " four ")
    data = np.char.replace(data, "5", " five ")
    data = np.char.replace(data, "6", " six ")
    data = np.char.replace(data, "7", " seven ")
    data = np.char.replace(data, "8", " eight ")
    data = np.char.replace(data, "9", " nine ")
    return data

1.3.4 Remove Punctuaction

def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
    data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

1.3.5 Remove Stop Words

La lista di stop words utilizzata ci è fornita dalla libreria nltk . Per poterla utilizzare è necessario eseguire il seguente codice python, che scaricherà le stopwords nella cartella $HOME/nltk_data/corpus/ .

import nltk
nltk.download('downloads')

Per fare un esempio alcune delle stopwords presenti nella lista inglese sono le seguenti

stop_words = stopwords.words('english')
print(stop_words[:10])
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Il codice che processa il testo per rimuovere le stop words è quindi il seguente

def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words:
            new_text = new_text + " " + w
    return np.char.strip(new_text)

1.3.6 Remove Apostrophe

def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

1.3.7 Remove Single Characters

Qui utilizziamo la funzione word_tokenize() che ci viene offerta dalla libreria nltk . Maggiori informazioni rispetto alla relativa implementazione possono essere trovate qui.

def remove_single_characters(data):
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if len(w) > 1:
            new_text = new_text + " " + w
    return np.char.strip(new_text)

1.3.8 Stemming

Nuovamente, il PorterStemmer() ci viene offerto dalla libreria nltk . Maggiori informazioni possono essere trovate qui. Per far funzionare lo stemming dobbiamo eseguire il seguente codice python.

import nltk
nltk.download('punkt')

def stemming(data):
    stemmer= PorterStemmer()
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return np.char.strip(new_text)

1.4 Index Construction

Per quanto riguarda la costruzione dell'indice, l'idea è quella di utilizzare i data frame offerti dalla libreria pandas . Per meggiori informazioni è possibile utilizzare il seguente link Pandas DataFrame

Il codice che si occupa della creazione della struttura dati principale è il seguente

doc = 0
postings = pd.DataFrame()

for path in paths:
    # open file, get text, close  file
    file = open(path, 'r', encoding='cp1250')
    text = file.read().strip()
    file.close()
    
    # preprocess text
    preprocessed_text = preprocess(text, False)

    # Index each token in the document in our data structure.
    tokens = word_tokenize(str(preprocessed_text))
    for token in tokens: 
        if token in postings:
            p = postings[token][0] # get posting list associated with token 
            p.add(doc)             # add docID to posting list
            postings[token][0] = p # update posting list associated with token
        else:
            # create new posting list for token with a single docID
            postings.insert(value=[{doc}], loc=0, column=token)

    # increment docID variable
    doc += 1

La struttura creata è quella descritta in una delle prime lezioni. Per analizzarla meglio possiamo eseguire i seguenti comandi python

print(postings.info(verbose=True))
#    class pandas.core.frame.DataFrame
#    RangeIndex: 1 entries, 0 to 0
#    Data columns (total 485 columns):
#    #   Column           Dtype 
#    ---  ------           ----- 
#    0   license          object
#    1   privat           object
#    2   other            object
#    3   news             object
#    4   keep             object
#    5   dont             object
#    6   apparantli       object
#    7   success          object
#    8   determin         object
#    9   sell             object
#    10  softwar          object
#    11  concept          object
#    ...

Ogni riga della tabella è poi formata da una lista di docIDs nel seguente modo

print(postings['zero'])
# 0    {0, 2, 3, 6, 7, 8, 9}
# Name: zero, dtype: object

Per salvare sul disco la struttura dati in modo da mantenere un certo livello di persistenza possiamo utilizzare la libreria python pickle nel seguente modo

#  write to file
postings.to_pickle(title + "_unigram_postings")
# read from file
postings = pd.read_pickle(title + "_unigram_postings")

Infine, per interagire con la struttura dati definiamo le seguenti due funzioni:

La funzione get_posting(word) non fa altro che ritornare la posting list associata alla parola word :

def get_posting(word):
    if word in postings:
        return postings[word][0]
    else:
        return []

La funzione print_word_postings(word) stampa la posting list associata alla parola word :

def print_word_postings(word):
    preprocessed_word = str(preprocess(word, True))
    print(preprocessed_word)
    postings = get_posting(preprocessed_word)
    print("Document Frequency:",len(postings))
    print("Postings List:",postings)

1.5 Query Language

Andiamo adesso ad implementare la nostra query language. Le query che vogliamo eseguire sono semplici query booleane. Segue qualche esempio

$\text{place}$
$\text{authority}$
$\text{place AND NOT authority}$
$\text{place OR authority}$
$\text{place OR authority AND NOT welcome}$

La funzione principale che si occupa dell'esecuzione delle nostre query è la seguente.

def execute_query(query):
    command_tokens, query_tokens = generate_query_tokens(query)
    tup = generate_not_tuple(query_tokens, command_tokens)
    
    print("\nCommands:", command_tokens)
    print("\nQuery Words:", query_tokens)
    print("\nNot Tup:", len(tup))
    
    final_set = binary_operations(query_tokens, command_tokens, tup)
    final_set = sorted(final_set)
    
    print("\nFinal Set:",final_set)
    print("-----------------------------------------")
    return final_set

ed è così descritta:

La chiamata alla funzione generate_query_tokens() serve per leggere la query e spezzarla in due sequenze di tokens: i command_tokens , che sono i token relativi ai comandi tipo $\text{AND, OR, NOT}$, e i query_tokens , che sono i termini presenti nella query tipo $\text{place, authority}$. Tale funzione è implementata nel seguente modo

def generate_query_tokens(query):
    query = query.lower()
    tokens = word_tokenize(query)
    
    command_tokens = []
    query_tokens = []
    
    for t in tokens:
        if t not in ['and', 'or', 'not']:
            processed_token = preprocess([t], True)
            query_tokens.append(str(processed_token))
        else:
            command_tokens.append(t)
            
    return command_tokens, query_tokens

La funzione get_not_tuple() invece esegue tutti i comandi della forma $\text{NOT} \langle \text{token} \rangle$ e ritorna in un insieme i risultati dei comandi eseguiti indicizzati in base alla posizione del comando rispetto a tutti i comandi. Tale funzione è implementata nel seguente modo

def generate_not_tuple(query_tokens, command_tokens):
    tup = set()
    # as long as there are NOT command to execute
    while 'not' in commands:
        i = command_tokens.index('not')
        word = query_tokens[i]
        word_postings = get_not(word) # get result of NOT <word> 
        tup.update(word_postings)
        command_tokens.pop(i) # remove command
        # Replace the word with an ID to signal that it has been
        # processed 
        # 
        # NOTE: numbers have been already removed in the preprocessing
        # stage.
        query_words[i] = i
        print("\nAfter Not Processing:",commands, query_words)
    return tup

Tale funzione esegue a sua volta la funzione get_not(word), che non fa altro che calcolare la posting list contenente il $\text{docID}$ di tutti i documenti che non contengono la parola $\text{word}$. Tale funzione è implementata come segue

def get_not(word):
    a = get_posting(word)
    b = set(range(len(paths)))
    return b.difference(a)

dove paths era la lista che ci eravamo calcolati inizialmente e che conteneva le paths di tutti i documenti che abbiamo indicizzato.

Infine, la funzione binary_operations() viene utilizzata per eseguire i restanti comandi e calcolare il risultato finale. Tale funzione è implementata come segue

def binary_operations(query_tokens, command_tokens, tup):
    if not query_tokens[0]:
        return tup

    a = get_posting(query_tokens[0])
    query_tokens.pop(0)

    for i in range(len(command_tokens)):
        if type(query_tokens[i])==int:
            b = tup
        else:
            b = get_posting(query_tokens[i])

        if command_tokens[i] == 'and':
            a = a.intersection(b)
        elif command_tokens[i] == 'or':
            a = a.union(b)
        else:
            print("Invalid Command")

    return a

1.5.1 Examples

Andando ad eseguire alcune query esempio otteniamo i seguenti risultati. Notiamo che le query non sono state eseguite su tutto il dataset, ma solo su sui primi 50 documenti della sottocategoria comp.graphics .

execute_query("place")
# -----------------------------------------
# Command Tokens: []
#
# Query Tokens: ['place']
#
# Not Tup: 0
#
# Final Set: [40]
# -----------------------------------------

execute_query("authority")
# -----------------------------------------
#
# Command Tokens: []
#
# Query Tokens: ['author']
#
# Not Tup: 0
#
# Final Set: [1, 15, 34]
# -----------------------------------------

execute_query("welcome")
# -----------------------------------------
#
# Command Tokens: []
#
# Query Tokens: ['welcom']
#
# Not Tup: 0
#
# Final Set: [20, 42]
# -----------------------------------------

execute_query("place or authority")
# -----------------------------------------
#
# Command Tokens: ['or']
#
# Query Tokens: ['place', 'author']
#
# Not Tup: 0
#
# Final Set: [1, 15, 34, 40]
# -----------------------------------------

execute_query("place and not authority")
# -----------------------------------------
#
# After Not Processing: ['and'] ['place', 1]
#
# Command Tokens: ['and']
#
# Query Tokens: ['place', 1]
#
# Not Tup: 997
#
# Final Set: [40]
# -----------------------------------------

execute_query("place or authority and not welcome")
# -----------------------------------------
#
# After Not Processing: ['or', 'and'] ['place', 'author', 2]
#
# Command Tokens: ['or', 'and']
#
# Query Tokens: ['place', 'author', 2]
#
# Not Tup: 998
#
# Final Set: [1, 15, 34, 40]
# -----------------------------------------

1.6 Exercises

Seguono alcuni possibili esercizi che consistono nel modificare il codice appena descritto:

Cambiare l'implementazione del codice che costruire le varie posting lists in modo tale da avere delle posting lists ordinate rispetto ai vari $\text{docIDs}$.

Ottimizzare l'implementazione dell'operazione $\text{NOT}$.

Implementare un concetto di priorità tra le operazioni utilizzando le parentesi.

2 Positional Index

In quest'altro esempio abbiamo invece indicizzato un insieme di documenti utilizzando un indice posizionale, in modo da poter rispondere a query formate da più parole della forma $\text{Tor Vergata}$. Nuovamente, Il codice è stato in parte preso dalla seguente github repo, ed è disponibile sia come jupyter notebook e sia come normale file python.

Notiamo che il codice utilizzato per questo esempio è molto simile a quello utilizzato per il precedente. Al posto di riportare tutto quindi riportiamo solo le differenze. Ad esempio gli steps di importazione, data extraction e pre-processing sono essenzialmente gli stessi. Le differenze iniziano a partire dalla costruzione dell'indice.

2.1 Positional Index Construction

La costruzione dell'indice questa volta viene nel seguente modo

postings = pd.DataFrame()
frequency = pd.DataFrame()
doc = 0

for path in paths:
    file = open(path, 'r', encoding='cp1250')
    text = file.read().strip()
    file.close()
    preprocessed_text = preprocess(text, False)
    tokens = word_tokenize(str(preprocessed_text))

    pos = 0
    for token in tokens:
        if token in postings:
            p = postings[token][0]            
            k = [a[0] for a in p]  # get all docIDs
            if doc in k:
                for a in p:
                    if a[0] == doc:
                        a[1].add(pos)
                else:
                    p.append([doc,{pos}])
                    frequency[token][0] += 1
        else:
            postings.insert(value=[[[doc, {pos}]]], loc=0, column=token)
            frequency.insert(value=[1], loc=0, column=token)

        pos += 1
    doc += 1

Per quanto riguarda l'interazione con la struttura dati abbiamo le seguenti funzioni utili

def get_word_postings(word):
    preprocessed_word = str(preprocess(word, True))
    term_frequency = frequency[preprocessed_word][0]
    term_postings = postings[preprocessed_word][0]
    return term_frequency, term_postings

def get_posting(word):
    if word in postings:
        return postings[word][0]
    else:
        return []

def get_positions(posting_values, doc):
    for value in posting_values:
        if value[0] == doc:
            return posting_value[1]
    return {}

2.2 Query Language

In questo esempio siamo interessati ad implementare un linguaggio di query che ci permette di rispondere a query del tipo

$\text{Los Angeles}$
$\text{Tor Vergata}$
$\text{Mountain View}$

Per fare questo utilizziamo la seguente funzione

def run_query(query):
    processed_query = preprocess(query, True)
    query_tokens = word_tokenize(str(processed_query))
    
    if len(query_tokens) == 1:
        query_postings = get_posting(query_tokens[0])
        result = [a[0] for a in query_postings]
        return result
    
    init_word = query_tokens[0]
    init_matches = gen_init_set_matchings(init_word)
    query_tokens.pop(0)
    total_matched_docs = match_positional_index(init_matches, query_tokens)
    total_matched_docs = sorted(total_matched_docs)    
    return total_matched_docs

tale funziona controlla se la query ha un solo termine. In quel caso calcola tramite la funzione get_posting() quali sono i $\text{docIDs}$ che contengono tale termine e gli ritorna come risultato. Altrimenti, se la funzione ha più una parola, calcola tutti i $\text{docIDs}$ che contengono almeno il primo termine tramite la funzione gen_init_set_matchings() , che è implementata come segue

def gen_init_set_matchings(word):
    init = []
    word_postings = get_posting(word)
    for word_posting in word_postings:
        for positions in word_posting[1]:
            init.append((word_posting[0], positions))
    return init

tale funzione non fa altro che ritornare una lista di coppie di elementi della forma (docID, pos) .

Per finire viene poi chiamata la funzione match_positional_index , che utilizza la lista di coppie init_matches e i restanti query_tokens per calcolare quali sono i $\text{docIDs}$ che contengono esattamente tutti i token della query nel giusto ordine. Una possibile implementazione per tale funzione è la seguente

def match_positional_index(init, query_tokens):
    matched_docs = []
    for p in init:
        doc, pos = p[0], p[1]
        count = 0
        
        for query_token in query_tokens:
            pos = pos+1
            query_token_pos = get_posting(query_token)
            docs_list = [z[0] for z in query_token_pos]
            if doc in docs_list:
                doc_positions = get_positions(query_token_pos, doc)
                if pos in doc_positions:
                    count += 1
                else:
                    count += 1
                    break

            if count == len(query_tokens):
                matched_docs.append(p[0])
                
    return set(matched_docs)

2.2.1 Examples

A seguire riportiamo alcuni esempi di esecuzione (alcuni risultati sono stati tagliati). Come prima cosa andiamo ad analizzare le posting lists dei termini $\text{Mountaing}$ e $\text{View}$.

get_word_postings("Mountain")
# mountain
# Document Frequency: 12
# Postings List: [[52, {2044, 3092, 5006}],
#                 [179, {861, 669}],
#                 [302, {954}],
#                 [371, {403}],
#                 [428, {2174}],
#                 [430, {159}],
#                 [472, {388}],
#                 [573, {2174}],
#                 [634, {48}],
#                 [860, {4953, 3039, 1991}],
#                 [919, {839}],
#                 [968, {20}]]

get_word_postings("View")
# view
# Document Frequency: 68
# Postings List: [[0, {36}],
#               [2, {88}],
#               [51, {19}],
#               [52, {2032, 2705, 3093, 5007}],
#               [107, {330, 837, 214, 311}],
#               [113, {672, 258, 359, 625, 342, 283, 253}]]

Procediamo quindi con l'esecuzione di tre query. Le prime due sono relative ai singoli termini analizzati in precedenza, mentre la terza query trova tutti i documenti che contengono la combinazione $\text{Mountain View}$.

run_query("Mountain")
# mountain
# Query tokens: ['mountain']
# Total Document Mathces: [52, 179, 302, 371,
#                          428, 430, 472, 573,
#                          634, 860, 919, 968]

run_query("View")
# view
# Query tokens: ['view']
# Total Document Mathces: [0, 2, 51, 52, 107, 113, 123, 125, 156,
#                           167, 169, 179, 181, 208, 258, 263,
#                           268, 302, 311, 318, 330, 364, 370,
#                           384, 428, 430, 446, 453, 461, 462,
#                           465, 475, 478, 496, 520, 558, 571,
#                           573, 581, 585, 652, 684, 689, 690, 708,
#                           715, 734, 752, 755, 763, 768, 790, 805,
#                           811, 822, 848, 860, 887, 890, 891, 905,
#                           920, 927, 930, 958, 973, 975, 992]

run_query("Mountain View")
# mountain view
# Query tokens: ['mountain', 'view']
# Initial Matches: [(52, 2044), (52, 3092), (52, 5006),
#                   (179, 861), (179, 669), (302, 954),
#                   (371, 403), (428, 2174), (430, 159),
#                   (472, 388), (573, 2174), (634, 48),
#                   (860, 4953), (860, 3039), (860, 1991),
#                   (919, 839), (968, 20)]
# Total Document Matches: [52, 179, 302, 428, 430, 573, 860]

2.3 Exercises

Seguono alcuni possibili esercizi che consistono nel modificare il codice appena descritto:

Cambiare l'implementazione del codice che costruire le varie posting lists in modo tale da avere delle posting lists ordinate rispetto ai vari $\text{docIDs}$.
Ottimizzare l'implementazione dell'operazione merge che viene effettuata all'interno della funzione match_positional_index() .
Integrare gli indici posizinali con il linguaggio di query utilizzato nel precedente esempio.