Project Link: https://github.com/haard7/IR-Project-A20540508
Description:
- There are three components developed in this projects which is aimed to articulate the end to end search engine functionality at small scale to understand how actually the web search engines work.
1) Crawler
-
It is Scrapy based crawler which download the
html
documents based on given URLs and parameters ofmax_depth
andmax_pages
. -
In the given project I have used wikipedia URLs, mostly related to renewable energy and power. basically I am downloading the bunch of html documents in
-->crawler-->Data
directory. those all html documents will be used in the next component to build the inverted index. Documents are downloaded with the name of last path of the url to keep track of the documents.
2) Indexer
- It is sci-kit learn based indexer which create the inverted index on the data by parsing the html documents from the crawler.
- It uses the functionality of Tf-IDf score and Cosine Similarity to create the inverted-index
- In this components there are two files get generated. one,
inverted_index.json
which containes the postings corresponding to each term. second,content.json
to store the document id and corrersponding document_name and Content which will also print in Flask based processor to debug whether our search results are working well or not. - You can also locally test the indexer by running
python indexer.py
and modifying theconfig.json
to see the console output of the list of top-k documents being printed
3) Processor
- It
Flask
based processor to print print the top-k results by performing the query validation/error checking and spelling correction. I have usedNLTK
for stopword removal andFuzzyWuzzy
forspelling correction
. - my flask app give the UI results of top-k resutls for searched queries. It also give the JSON documents of top-k results. It includes the document name, document ID and Content as well.