This project provides a simple tool for training and testing a lightweight prediction model based on analyzing word frequencies using word2vec.
Utilized in BBOT within the ffuf_shortnames module.
Sample models are included in the trained_models
folder. These were trained based on data harvested from the Common Crawl project.
- Train a Model: Train a Word2Vec model on a text corpus and extract word frequency data into a lightweight format.
- Predict Words: Retrieve likely word completions for a given prefix using a given pre-trained model, ranked by frequency.
This project uses Poetry for dependency management.
- Python 3.9 or higher
- Poetry (install with
pip install poetry
if not already installed)
-
Clone the repository:
git clone https://github.com/yourusername/word-predictor.git cd word-predictor
-
Install dependencies:
poetry install
-
Activate the virtual environment:
poetry shell
Or run with poetry run:
poetry run python3 wordpredictor.py
The tool supports two modes: train
and test
.
Train a Word2Vec-based word predictor on a custom text file.
poetry run word-predictor train <file_path> [--min_count <value>] [--debug]
<file_path>
: Path to the input text file containing words (one per line).--min_count
: Minimum frequency for a word to be included in the vocabulary (default: 2).--debug
: Enable debug mode to print tokens during training.
poetry run word-predictor train words.txt --min_count 5
Test the predictor by retrieving predictions for a given prefix.
poetry run word-predictor test <model_path> --prefix <prefix> --n <top_n>
<model_path>
: Path to the trained.pred
file.--prefix
: Prefix to predict words for.--n
: Number of top predictions to retrieve.
poetry run word-predictor test words.pred --prefix "pre" --n 5