Welcome to condor-ir’s documentation!¶
condor-ir is a suite of tools to work with latent semantic search engines and other search engines as well, now it offers index database management computation of lsa ranking matrices and query tools.
Contents:
Quickstart for users¶
Setup¶
In order to use the condor-ir you need to setup and configure a mongodb database and have it run in the default port as the application database connection is not yet configurable.
You can also install the condor-ir from PyPI using the pip tool
$ pip install condor-ir
Furhtermore, for language support you need to install some dictionaries and a spell checking engine:
# Arch
sudo pacman -S enchant aspell-es aspell-en aspell-fr aspell-it aspell-pt
# Ubuntu
sudo apt-get install enchant \
aspell-es aspell-en aspell-fr aspell-it aspell-pt
Once you have the mongodb daemon running in your system you can start building your models using any of the three supported formats:
- Froac
- Isi plain text
- Bibtex
Repository preparation¶
Once you’re done with the setup, you should start preparing your documents to feed the index database, you can order your sources anyway you want in your filesystem because the condor-ir cli uses a glob matching interface so you can find the documents you want, an example can be the example repository provided along with this package:
$ tree ../data
../data
+── bib
│ +── oaa.bib
+── froac
│ +── froac1
│ │ +── 30Algebra relacional. Operación Composición.xml
│ │ +── 71Estandares.xml
│ │ +── 7Programacion lineal.xml
│ │ +── 83Modelo - Vista - Controlador.xml
│ │ +── 85Video objeto de aprendizaje CrazyTalk.xml
│ │ +── 86Aprendiendo con Cuadernia.xml
│ │ +── 87Introduccion a eXe Learning.xml
│ │ +── 89OA 1 Clase UTP.xml
│ │ +── 92Prueba parcial 1.xml
│ │ +── 97Respuesta libre en circuitos de primer orden.xml
│ │ +── ...
│ +── roapManizales1.xml
│ +── roapManizales2.xml
│ +── roapManizales3.xml
│ +── roapManizales4.xml
│ +── roapManizales5.xml
+── isi
+── isi.txt
Populate the database¶
Once you have your dataset organized you can populate your database using the condorpopulate program.
condorpopulate --xml 'data/*/*.xml'
You can also stipupate a database name and specify if you want to wipe the database, if you want to combine records from different kinds of databases, you can do so by reruning the condorpopulate tool with the –no-wipedb flag,
condorpopulate --isi --no-wipedb 'data/*/*.isi'
condorpopulate --bib --no-wipedb 'data/*/*.bib'
However, whenever using this approximation, beaware of the record duplication as the hashing algorithms used to detect duplicates are different for the different kinds of record files.
Build a model for the database¶
Once your database is populated you can build a model or ranking matrix for your database using the command:
condormodel
This program will create versioned models so that you can build different versions, or query with one model when another one is still being built.
Beaware that this is the most time consuming operation in the suite as it involves inverting a several thousand rank matrix.
Query the model¶
Once you have built the model you can start performing queries, you can do so by:
condorquery search terms
This will perform a query to the latest available model in the model database.
Next steps¶
- Learn more about latent semantic analysis
- Learn more about index databases
- Learn more about the different supported fileformats
API Reference¶
condor-program main part are the parsers that convert the different kinds
of database documents into manageable dictionaries that only contain the
interest metadata fields. Those are implemented in the record
module.
The record API¶
The main class of the record API is the RecordParser
class, which outlines an api that parses data out of a raw string, or raw data
structure into a dictionary with the desired interest fields, details about how
to extract that information will go into the
RecordParser
child classes.
Furthermore, the RecordParser
is complemented by the
RecordIterator
class, that outlines an interface to
iterate over a file containing several records and returning (yielding) all the
records in a memory efficient fashion.
The utility module¶
The scripts package¶
Database manipulation¶
Script entry points¶
The entry points are organized in modules, this leads to some code duplication
but it can be reduced in the future. The populate script, which yields the
condorpopulate command is located in the populate
module, and contains the information descripted bellow.
The model script, which yields the condormodel command is located in the
model
module, and contains the information descripted
bellow.
The query script, which yields the condorquery command is located in the
query
module, and contains the information descripted
bellow.
Developer quickstart¶
condor-ir is a free and open source software, contributions are very wellcome and you can start contributing through the common github pattern.
Setting up the development environment¶
First you need to clone the repository:
git clone https://github.com/odarbelaeze/condor-ir.git
# Alternatively you can fork and clone your own fork
git clone https://github.com/<username>/condor-ir.git
Then the most recomendable way for you to do development is in a virtual environment, we will assume that you are familiar with virtualenvwrapper
cd condor-ir
mkvirtualenv -a $PWD -p $(which python3) -r piprequirements.txt condorenv
pip install -e . # This command will also install dependencies
Running tests¶
When contributing to an open source project, it’s crucial that you are able to run its test suite, this project suggests pytest as testing framework and you can use its test runner to run our tests, in order to run our test you need to install testing dependencies, you can do so automatically using the setup.py script
python setup.py test
Otherwise, you could manually install the testing dependencies and run tests manually,
pip install pytest
python -m pytest
Furthermore, you can install pytest plugins such as coverage via the pytest-cov plugin and run the test suite using their custom flags.
Contributing to the code¶
What you can contribute¶
You can contribute to the code, not only by adding new features and smashing yourself to the keyboard at the core of the program, some examples of contributions you can make are the following:
- Write documentation
- Write tests, maybe while you are trying to understand what the code does
- Refactor code, here and there there are optimizations that can be made
Otherwise, you can expand the funcionality of th erecord parsers by adding support for more document kinds.
How to contribute¶
This project supports the feature branch and pull request (PR) way of doing contributions through git, you start by adding a new branch to your repo,
# While in the repo folder
git checkout -b <a-name-that-outlines-your-contribution>
# Do some ground work
git push -u <origin> <a-name-that-outlines-your-contribution>
Then you proced to create the pull request in github, right away, so you can start communicating with the core developers and other members of the community.
Recomendations and code guidelines¶
Honoring code guidelines is a key part of contributing to a project, that’s because fixing indentation and correcting line lengths is assumed by git as a rewrite, and the authorship of the contributed code might be diluted. This project includes a .editorconfig file that can be used by modern text editors to automatically keep coding guidelines for different filetypes.
Although the code coverage of this project might be low at some point, contributors are encouraged to write tests for their features, and also, regression tests for the code that is already there. Test are writen using pytest which is a very easy to use testing framework, to start you just need to drop a function whos name starts with test_ into one module within the tests folder, and the suite will automatically pick up your testing code.
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line