Building Hybrid Search Engine : Combining Keyword and Semantic Search
In today’s digital age, the ability to efficiently retrieve relevant information from vast amounts of data is crucial. Traditional keyword search methods often fall short in understanding the context of user queries, leading to irrelevant results. This article explores the development of a hybrid search engine that combines keyword and semantic search techniques using MySQL, ChromaDB, Python, Streamlit, and Sentence Transformers.
Table of Contents
- Project Overview
- Project Structure
- Setting Up the Environment
- Configuring the Project
- Creating the Database Connection
- Generating Embeddings for Semantic Search
- Building the Streamlit Application
- Understanding the Search Functionality
- Running the Application
- Conclusion
Project Overview
The goal of this project is to create a hybrid search system that can process user queries through both keyword matching in a MySQL database and semantic understanding using embeddings stored in ChromaDB. The system leverages the power of the Sentence Transformers library to create embeddings for product descriptions, enhancing the search capabilities beyond simple keyword matching.
Problems Prior to This Hybrid Search Approach
- Limited Keyword Search:
- Traditional search engines only look for exact matches of keywords in a database. If the user doesn’t type the exact product name or words, they might not find what they’re looking for.
- Example: If a user searches for “cheap smartphones,” the system might miss products with descriptions like “affordable phones.”
- Context Understanding:
- Traditional searches struggle to understand the meaning behind the user’s query. They can’t recognize synonyms or related concepts.
- Example: Searching for “running shoes” may not show results for “sports sneakers” even though they mean the same thing.
- Inefficiency in Handling Descriptions:
- Keyword searches don’t handle long product descriptions well. They miss out on context, leading to irrelevant or incomplete results.
- Example: Searching for “comfortable office chair” might not bring up results with long descriptions that talk about ergonomic design and comfort.
What This Hybrid Search Solves
- Combines Keyword and Semantic Search:
- The hybrid system allows users to search using both traditional keyword matching (MySQL) and a more advanced semantic understanding (ChromaDB). This means users can find relevant products even if they don’t know the exact words used in the product description.
- Solution: Improves accuracy and relevance by considering both exact matches and the meaning of the words.
- Understands the Meaning of Queries:
- By using sentence embeddings (semantic search), the system understands the context and meaning behind a user’s query.
- Solution: Finds products based on the overall description, synonyms, and similar concepts, rather than just keywords.
- Handles Descriptions More Effectively:
- The system can search through and understand lengthy product descriptions, pulling out relevant results even if certain keywords aren’t present.
- Solution: Users get more accurate results for detailed or complex product descriptions.
Project Structure
The project structure is organized as follows:
hybrid_search
│
├── chroma_db
│
├── images
│ └── product_search.png
│
├── scripts
│ ├── __init__.py
| |── db.py
│ ├── create_embeddings.py
│ ├── query_embeddings.py
│
├── config.ini
├── main.py
├── products.sql
├── readme.md
└── requirements.txt
Description of Project Files:
chroma_db/
: Directory for ChromaDB-related files.images/
: Directory for images used in the Streamlit UI.scripts/
: Contains Python scripts for creating and querying embeddings.- __init__.py : initialize folder as python module
- create_embeddings.py : create embeddings of product description and store into chroma vector DB
- db.py : Script to manage MySQL database connections.
- query_embeddings.py : query the embeddings from vector DB
config.ini
: Configuration file for database and ChromaDB settings.main.py
: The main application file that runs the Streamlit UI.products.sql
: SQL script for setting up the initial database schema.readme.md
: Documentation file for the project.requirements.txt
: Lists the Python packages required for the project.
What is a Vector Database?
A vector database is a type of database designed to store and search data based on numerical vectors (numbers arranged in a specific order). In this case, each piece of text, like a product description, is turned into a vector, which is a series of numbers that represent the meaning or context of the text.
- How it works: Instead of just storing text like traditional databases (e.g., MySQL), vector databases store the meaning of the text as vectors.
- Why it’s useful: By storing the meaning as vectors, these databases can perform semantic searches, which means they find similar or related items, even if they don’t contain the exact same words. For example, if you search for “comfortable chair,” it can return results that talk about “ergonomic seating” because they mean similar things.
In our project, ChromaDB is a vector database that stores embeddings (meaningful vectors) of product descriptions. When a user searches by description, the database returns results with similar meanings, not just exact keyword matches.
What are Sentence Transformers?
Sentence Transformers are tools that take a sentence or a piece of text and convert it into a vector (a set of numbers). This vector represents the meaning of the text in a way that machines can understand and compare.
- What it does: It takes input text (like product descriptions) and turns it into a vector that captures the meaning of the sentence. This makes it easier for the vector database to compare descriptions and find similar ones.
- How it works: Sentence Transformers are based on deep learning models (specifically transformers) that learn patterns from large amounts of text data. They can understand the meaning of a sentence by looking at the relationships between the words.
- Why it’s useful: By converting sentences into vectors, the transformer allows for semantic search, where you can search for meaning instead of just keywords. This is why searching for “affordable laptop” might return results for “budget computer.”
In our project, Sentence Transformers (all-MiniLM-L6-v2) are used to create embeddings (vectors) for product descriptions, which are stored in ChromaDB. When a user searches, the transformer converts the query into a vector, and the vector database finds similar products based on this vector.
Setting Up the Environment
1. Install Required Packages
To get started, ensure you have the necessary Python packages installed. You can install the required packages using pip:
pip install chromadb pymysql sentence-transformers streamlit pandas
Create a MySQL Database
Before running the application, you need to install and set up a MySQL database. Use the products.sql
file to create the required tables. Here’s an example SQL script you might use:
# create database
create database salesdb1;
# Select Database
use salesdb1;
# create table products
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
price DECIMAL(10, 2),
category VARCHAR(100),
description TEXT
);
Add Sample Data
Insert sample data into the products
table to test the search functionality. You can modify products.sql
or use SQL commands to insert data manually.
# insert few sample records into table (20)
INSERT INTO products (name, price, category, description) VALUES
('Wireless Mouse', 25.99, 'Electronics', 'A sleek wireless mouse with ergonomic design.'),
('Laptop Stand', 49.99, 'Accessories', 'Adjustable aluminum stand for laptops.'),
('Bluetooth Speaker', 89.99, 'Electronics', 'Portable Bluetooth speaker with high-quality sound.'),
('Smartphone', 699.99, 'Electronics', 'Latest 5G-enabled smartphone with 128GB storage.'),
('Office Chair', 199.99, 'Furniture', 'Ergonomic office chair with lumbar support.'),
('Noise Cancelling Headphones', 299.99, 'Electronics', 'Over-ear noise cancelling headphones with Bluetooth.'),
('Mechanical Keyboard', 79.99, 'Accessories', 'Mechanical keyboard with RGB backlighting.'),
('Gaming Mouse Pad', 19.99, 'Accessories', 'Large gaming mouse pad with anti-slip base.'),
('4K Monitor', 399.99, 'Electronics', '27-inch 4K UHD monitor with ultra-thin bezel.'),
('External Hard Drive', 59.99, 'Storage', '1TB external hard drive with USB 3.0.'),
('Wireless Earbuds', 129.99, 'Electronics', 'True wireless earbuds with noise cancellation.'),
('Fitness Tracker', 49.99, 'Wearables', 'Waterproof fitness tracker with heart rate monitoring.'),
('Portable Charger', 39.99, 'Accessories', '10000mAh portable charger with fast charging.'),
('Smartwatch', 199.99, 'Wearables', 'Smartwatch with GPS and heart rate sensor.'),
('Tablet', 299.99, 'Electronics', '10.5-inch tablet with 64GB storage and stylus support.'),
('Electric Kettle', 34.99, 'Appliances', 'Electric kettle with automatic shut-off.'),
('Air Purifier', 149.99, 'Appliances', 'HEPA air purifier with three fan speeds.'),
('Standing Desk', 499.99, 'Furniture', 'Adjustable standing desk with electric height adjustment.'),
('LED Desk Lamp', 29.99, 'Furniture', 'LED desk lamp with adjustable brightness and angle.'),
('Cordless Vacuum', 199.99, 'Appliances', 'Lightweight cordless vacuum cleaner with powerful suction.');
Configuring the Project
The configuration settings are stored in the config.ini
file. Here’s a sample configuration:
[mysql]
host = localhost
port = 3306
user = your_username
password = your_password
database_name = salesdb1
[chroma]
num_of_results = 3
collection_name = product_descriptions
chroma_directory = chroma_db
embedding_model = all-MiniLM-L6-v2
Make sure to replace the MySQL credentials with your own.
Creating the Database Connection
The db.py
file contains the code to establish a connection to the MySQL database. Here’s the relevant code:
import pymysql
import configparser
config = configparser.ConfigParser()
config.read("config.ini")
def get_mysql_conn():
conn = pymysql.connect(
host=config['mysql']['host'],
port=int(config['mysql']['port']),
user=config['mysql']['user'],
password=config['mysql']['password'],
database=config['mysql']['database_name'])
return conn
This function will be used to get a connection whenever database access is needed.
Generating Embeddings for Semantic Search
The create_embeddings.py
script is responsible for generating embeddings for product descriptions and adding them to ChromaDB. The script uses Sentence Transformers to create embeddings based on the descriptions fetched from the MySQL database.
import chromadb
import configparser
from db import get_mysql_conn
from sentence_transformers import SentenceTransformer
config = configparser.ConfigParser()
config.read("config.ini")
chroma_directory = config['chroma']['chroma_directory']
collection_name = config['chroma']['collection_name']
embedding_model = config['chroma']['embedding_model']
model = SentenceTransformer(embedding_model)
# Initialize Chroma DB
chroma_client = chromadb.PersistentClient(path = chroma_directory)
collection = chroma_client.create_collection(collection_name)
# Get MySQL connection
conn = get_mysql_conn()
with conn.cursor() as cursor:
cursor.execute("SELECT id, name, price, category, description FROM products")
products = cursor.fetchall()
# Add embeddings to Chroma
for product in products:
product_id, name, price, category, description = product
embedding = model.encode(description)
collection.add(
ids=[str(product_id)],
embeddings=[embedding],
metadatas=[{"id":product_id,"name": name,
"price": str(price), "category": category,
"description":description}]
)
print(f"Product{product_id} embeddings created")
print("Embeddings added to Chroma DB.")
conn.close()
Explanation:
- The script initializes ChromaDB and fetches product data from MySQL.
- For each product, it generates an embedding for the description and adds it to ChromaDB along with its metadata.
Building the Streamlit Application
The main functionality of the hybrid search engine is implemented in main.py
, which serves as the interface for users to input their queries and view results.
Key Components of main.py
:
- Streamlit Configuration: Set up the Streamlit page and sidebar options.
- User Input Handling: Allow users to select the type of query (Product Name, Description, or Price Range).
- Database and ChromaDB Queries: Execute queries based on user input and display results.
import chromadb
import configparser
import pandas as pd
import streamlit as st
from scripts.db import get_mysql_conn
from sentence_transformers import SentenceTransformer
config = configparser.ConfigParser()
config.read("config.ini")
chroma_directory = config['chroma']['chroma_directory']
collection_name = config['chroma']['collection_name']
num_of_results = int(config['chroma']['num_of_results'])
embedding_model = config['chroma']['embedding_model']
chroma_client = chromadb.PersistentClient(path = chroma_directory)
collection = chroma_client.get_collection(collection_name)
embedding_model = SentenceTransformer(embedding_model)
conn = get_mysql_conn()
st.set_page_config(
page_title="Hybrid Product Search",
page_icon="🛒",
layout="centered",
initial_sidebar_state="expanded"
)
st.sidebar.header("Semantic Search")
st.sidebar.image("images/product_search.png", use_column_width=True,
caption="Product Semantic Search")
st.header("Hybrid Product Semantic Search")
query_type = st.selectbox("Search by:", ["Product Name", "Description",
"Price Range"])
if query_type == "Product Name":
product_name = st.text_input("Enter product name:")
if st.button("Search"):
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM products WHERE name LIKE %s",
('%' + product_name + '%',))
results = cursor.fetchall()
if results:
columns = ['ID', 'Name', 'Price', 'Category', 'Description']
df_results = pd.DataFrame(results, columns=columns)
st.subheader("Results from MySQL:")
st.table(df_results)
else:
st.write("No products found.")
elif query_type == "Description":
description_query = st.text_input("Enter product description:")
if st.button("Search"):
query_vector = embedding_model.encode(description_query)
results = collection.query(query_vector, n_results = num_of_results)
st.write("Results from Chroma DB:")
for result in results['metadatas']:
st.table(result)
elif query_type == "Price Range":
min_price = st.number_input("Minimum Price:", min_value=0.0)
max_price = st.number_input("Maximum Price:", min_value=0.0)
if st.button("Search"):
with conn.cursor() as cursor:
cursor.execute("SELECT * FROM products WHERE price BETWEEN %s AND %s",
(min_price, max_price))
results = cursor.fetchall()
if results:
columns = ['ID', 'Name', 'Price', 'Category', 'Description']
df_results = pd.DataFrame(results, columns=columns)
st.subheader("Results from MySQL:")
st.table(df_results)
else:
st.write("No products found.")
Explanation:
- Users can select a search type from the sidebar: product name, description, or price range.
- Depending on the selection, the application performs the relevant query in MySQL or ChromaDB and displays the results.
Understanding the Search Functionality
The search functionality allows users to retrieve relevant products based on their input. Here’s how each search type works:
- Product Name: This uses a SQL query to match product names that contain the user input, returning relevant records from MySQL.
- Description: This employs ChromaDB to perform semantic searches using embeddings, returning products with descriptions similar to the user’s query.
- Price Range: This allows users to filter products within a specified price range using a SQL query.
Running the Application
First we need to create embeddings and store into vector DB.
python scripts\create_embeddings.py
To run the application, navigate to the project directory and use the following command:
streamlit run main.py
This command starts the Streamlit server and opens the application in your default web browser.
Where This Hybrid Search is Used
- E-commerce Websites:
- Helps users find products based on both names and detailed descriptions, even if their search terms don’t exactly match product titles.
- Example: Users can search for “budget-friendly smartphones” and still find phones described as “affordable.”
- Knowledge Databases:
- Useful for companies or organizations that have large amounts of information (like articles, product guides, or manuals) to help users quickly find the most relevant document.
- Example: A user searching for “how to fix a car engine” might find an article titled “engine repair guide.”
- Product or Content Search Systems:
- Ideal for platforms with large catalogs of items or content, where users need both quick exact keyword matches and in-depth searches to find the most suitable result.
- Example: In online libraries or media platforms where descriptions are long and varied.
By solving these key problems, the hybrid search system improves search efficiency, accuracy, and user satisfaction in various practical applications.
Conclusion
This hybrid search engine project demonstrates how to effectively combine keyword-based and semantic search techniques to enhance user experience. By leveraging MySQL for structured data and ChromaDB for semantic understanding, the application provides users with relevant and context-aware search results.