SF Event Recommender using LLMs

Gaurav Mohan
8 min readFeb 25, 2023

Large Language Models can be used in every step of the process for developing a recommendation system. In this project I utilized OpenAI’s models to synthesize, format, store, and apply embeddings on the data to tailor recommendations for myself.

San Francisco has a ton of events but unfortunately residents have to really look hard to find events on a weekly basis. Most events can be found on venue websites or on free event posting sites. However, for most sites to get personalized recommendations you have to pay. As an SF resident, I wanted to tackle the problem of creating my own event recommender system without having to rely on paid services or go through the trouble of sifting through numerous events-based websites and instagram pages.

I scraped event information from SF Fun Cheap, Eventbrite, and 19hz (an EDM and techno focused SF music event page). I have added a quick demo of the frontend interface I made for myself.

Let’s understand what is occurring in the backend of this simple interface to better understand how these events are matched to my preferences and dynamically updated.

Data Processing with LLMs

The first challenge of building this app is parsing the scraped data into a standardized format. An LLM can be super helpful to not only format data but to also create more features for the model. For example, in each of the event scrapers I used GPT-3 to parse the data into a JSON format.

def generate_prompt(event):
prompt = """Extract information from the event below and store it into a
JSON with the following structure
{venue: venue_name, date: date_and_time, age: age_requirement, price: price, location: address}""" +
' and enclose each key and value in the JSON structure with double quotes.' +
'\n' + "Event: " + event
return prompt


sf_events = pd.DataFrame()

for link,headline in zip(hrefs,headlines):
headline = headline.strip()
if link is not None and len(headline) > 0:
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
event_name = soup.find('h1')
event_name = event_name.get_text()
text = ''
div = soup.find('div', {'id': 'stats'})
text += div.get_text()

response = openai.Completion.create(
model="text-davinci-003",
prompt=generate_prompt(text),
temperature=0.0,
max_tokens = 200,
)

response_text = response['choices'][0]['text'].strip()
response_dict = json.loads(response_text)

sf_events = sf_events.append(response_dict, ignore_index=True)

I then passed in the parsed venue to GPT-3 to gather venue information and store it as a feature. I also used OpenCage to geocode the parsed event locations and store it into a Shapely Point. Again GPT comes in handy to find venue addresses incase the scraped events data doesn’t include it.

def find_venue_type(venue):
prompt = f'Provide a short description of the following venue in SF?
Utilize the venues website or online search to find the answer.' + '\n' +
'Venue: ' + venue
return prompt

venue_description = {}
for index, row in sf_events.iterrows():
if row['venue'] not in venue_description:
response = openai.Completion.create(
model="text-curie-001",
prompt=find_venue_type(row['venue']),
temperature=0.3,
max_tokens = 100,
)
response_text = response['choices'][0]['text'].strip()
venue_description[row['venue']] = response_text
sf_events.loc[index, 'venue_description'] = response_text

else:
sf_events.loc[index, 'venue_description'] = venue_description[row['venue']]

Event Embeddings

Since I only have to deal with one user, the recommendation system is processed through a similarity search. In order to perform a similarity search, the event and user data needs to be converted into a numeric format aka embeddings. For each of the data sources I perform the following embedding process.

import pandas as pd
import tiktoken
from openai.embeddings_utils import get_embedding
import os
import openai

openai.apikey = os.getenv("OPENAI_API_KEY")
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base" # this the encoding for text-embedding-ada-002
max_tokens = 5000 # the maximum for text-embedding-ada-002 is 8191

fc_df_embeddings = sf_events.applymap(str)
encoding = tiktoken.get_encoding(embedding_encoding)
sf_events["combined"] = (
"Venue: " + fc_df_embeddings.venue.str.strip() + "; Venue Description: " + fc_df_embeddings.venue_description.str.strip() +
"; Event: " + fc_df_embeddings.event_name.str.strip() + "; Event Description: " + fc_df_embeddings.event_description.str.strip() +
"; Price: " + fc_df_embeddings.price.str.strip() + "; Neighborhood: " + fc_df_embeddings.neighborhood.str.strip()
)

sf_events["n_tokens"] = sf_events.combined.apply(lambda x: len(encoding.encode(x)))
sf_events["embedding"] = sf_events.combined.apply(lambda x: get_embedding(x, engine=embedding_model))

The event embeddings are then stored as a numpy array. The same process is applied to the user embeddings and a similarity matrix is created with both the user and event embeddings.

event_embeddings = sf_events.embedding.apply(eval).apply(np.array)
event_embeddings = np.array(list(event_embeddings))
# User Embeddings

interests = ['Live music', 'DJ performances', 'Dancing', 'Karaoke', 'Comedy shows', 'Trivia nights', 'Art exhibits', 'Film screenings', 'Wine tasting', 'Craft beer tasting', 'Mixology classes', 'Food events', 'Sports viewing parties', 'Networking events', 'Tech events']
venue_types = ['Speakeasy Bar', 'Dive Bar', 'Nightclubs', 'Music Venue', 'Comedy Club', 'Theaters', 'Art Gallery', 'Film theaters', 'Wineries', 'Breweries', 'Cocktail Bars', 'Sports Bars', 'Restaurants with Live Music', 'Rooftop bars', 'Convention Centers']
price_range_pref = ['free', '$10-$20', '$20-$50', '$50-$100', 'over $100']
time_of_day_pref = ['daytime', 'evening', 'night']
neighborhoods_of_interest = ['Financial District', 'Nob Hill', 'Pacific Heights', 'Russian Hill', 'North Beach', 'The Mission', 'Castro', 'Haight-Ashbury', 'SoMa', 'Marina']

user_interests = ['Live music', 'DJ performances', 'Comedy shows', 'Food events', 'Networking events', 'Tech events']
user_venues = ['Speakeasy Bar', 'Dive Bar','Music Venue', 'Comedy Club', 'Cocktail Bars', 'Theaters']
user_prices = ['free', '$10-$20', '$20-$50']
user_time = ['evening', 'night']
user_neighborhoods = ['Nob Hill', 'Pacific Heights', 'Russian Hill', 'North Beach', 'The Mission', 'Castro', 'Haight-Ashbury']
encoding = tiktoken.get_encoding(embedding_encoding)
combined = ("Event Preferences: " + str(user_interests) + "; Venue Preferences: " + str(user_venues) +
"; Price Range Preferences: " + str(user_prices) + "; Time of Day Preferences: " + str(user_time) +
"; Neighborhoods of Interest: " + str(user_neighborhoods))

user_embedding = get_embedding(combined, engine=embedding_model)
user_embedding = np.array(user_embedding)

Similarity Search

Once the user preferences are combined and embedded along with each event from the scraped event pages. I can then apply a cosine similarity formula on the embeddings. Cosine similarity measures the similarity between two vectors using the following formula:

cosine_similarity(a, b) = (a · b) / (||a|| ||b||)

The similarity scores fall between [-1, 1] where 1 means the vectors are the same. I calculated the similarity matrix with the following function and then stored the events data along with the similarity and distance from the event into a MySQL database.

Every 2 weeks I run an automation script to scrape the data and recreate the events table in my database.

from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(user_embedding.reshape(1, -1), event_embeddings)


conn = pymysql.connect(host='localhost', user='root', password=mysql_pass, db='mysfevents')
cursor = conn.cursor()

create_table_query = """CREATE TABLE events (
id INT AUTO_INCREMENT PRIMARY KEY,
venue VARCHAR(150),
event VARCHAR(150),
date VARCHAR(150),
age VARCHAR(150),
price VARCHAR(150),
location VARCHAR(150),
venue_desc VARCHAR(3000),
event_desc VARCHAR(5000),
neighborhood VARCHAR(200),
distance FLOAT,
similarity FLOAT
);"""


cursor.execute(create_table_query)
conn.commit()

sql_query = "INSERT INTO events (venue, event, date, age, price, location, venue_desc, event_desc, neighborhood, distance, similarity) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"

data = database_df.to_records(index=False).tolist()


cursor.close()
conn.close()

The following function shows how I calculate the distance from the geo-coordinates of each event and my location

from math import radians, sin, cos, sqrt, atan2
from shapely.geometry import Point
from shapely.wkt import loads


my_loc = Point(37.7652844,-122.4329059)


def distance_in_miles(loc1, loc2):

loc2 = loads(loc2)
"""Calculate the distance in miles between two geolocation points using the Haversine formula."""
# Convert coordinates to radians
lon1, lat1 = radians(loc1.x), radians(loc1.y)
lon2, lat2 = radians(loc2.x), radians(loc2.y)


# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance_miles = 3963.0 * c # Radius of Earth in miles

return float(round(distance_miles,3))

df['distance_miles'] = df.apply(lambda row: distance_in_miles(my_loc, row['geolocation']), axis=1)

Dynamic Recommendations

After loading the data into a database I created a frontend and backend framework using Node.js and Javascript to view the events ranked by the similarity scores. I wanted my recommendations to learn dynamically and adjust as I interacted with the frontend (similar to most recommendation systems). Most companies like Netflix and Tiktok utilize this concept of Invisible AI which takes into account user interactions with their visible UI to tailor recommendations. Tiktok utilizes viewing and search history to recommend the next video while Netflix uses viewing history, user’s liked lists, and collaborative filtering to recommend movies and shows real-time.

In my application I created a like button where each time I like an event on my main page, it adds the EventID to a liked events table and runs a separate script in parallel that incorporates the liked events into my embeddings in order to re-calculate the similarity search. The goal is to tune the recommendations to my preferences and the events that I have liked thus far. Here is how I accomplish these changes dynamically.

# This is the front end component 
<tbody id="events-table">
<% events.forEach(event => { %>
<tr>
<td><%= event.id %></td>
<td><%= event.venue %></td>
<td><%= event.event %></td>
<td><%= event.event_desc %></td>
<td><%= event.date %></td>
<td><%= event.price %></td>
<td><%= event.location %></td>
<td><%= event.distance %></td>
<td>
<form action="/like" method="post" class="like-form">
<input type="hidden" name="eventId" value="<%= event.id %>">
<button type="submit" class="like-button">Like</button>
</form>
</td>
</tr>
<% }) %>
</tbody>
# This is the backend component

const { spawn } = require('child_process');

app.post('/like', (req, res) => {
console.log(req.body)
const eventId = req.body.eventId;
const query = 'INSERT INTO liked_events (event_id) VALUES (?)';
connection.query(query, [eventId], (err, result) => {
if (err) throw err;
const pyScript = spawn('python', ['./update_recs.ipynb']);
# This is where I call the python script to reassess similarity scores

pyScript.stdout.on('data', (data) => {
console.log(`stdout: ${data}`);
});

pyScript.stderr.on('data', (data) => {
console.error(`stderr: ${data}`);
});

pyScript.on('close', (code) => {
console.log(`child process exited with code ${code}`);
res.redirect('/');
});
});
});
# Invisible AI Component

liked_events_mask = np.isin(event_ids, liked_event_ids)
masked_event_embeddings = np.copy(event_embeddings)
masked_event_embeddings[liked_events_mask] = 0.0
user_embedding_reshaped = np.reshape(user_embedding, (1, -1))

all_embeddings = np.concatenate([user_embedding_reshaped, masked_event_embeddings])

similarity_scores = cosine_similarity(all_embeddings)[0][1:]

for i, event_id in enumerate(event_ids):
similarity = similarity_scores[i]
sql = "UPDATE events SET similarity = %s WHERE id = %s"
val = (similarity, event_id)
cursor.execute(sql, val)
  1. The code defines an array liked_events_mask that contains boolean values indicating whether each event in event_ids has been previously liked by the user (i.e., whether the event ID is in the liked_event_ids array).
  2. The code then creates a copy of the event_embeddings array and assigns it to the masked_event_embeddings variable.
  3. The code sets the embeddings of all events that have been liked by the user (as indicated by the liked_events_mask array) to zero in the masked_event_embeddings array. This effectively "masks" the embeddings of these events so that they don't contribute to the similarity calculation.
  4. The code then reshapes the user_embedding array into a 1-row matrix and assigns it to the user_embedding_reshaped variable.
  5. Finally, the code concatenates the reshaped user_embedding_reshaped matrix and the masked_event_embeddings matrix (which now contains the masked event embeddings), creating a new matrix all_embeddings that represents the user's preferences and the embeddings of all events that have not been liked by the user.

By masking the liked events from the event embeddings, the model is must find other events that are similar to my preferences. This will simultaneously increase the similarity scores of those events that have not been liked yet while keeping the liked events at the same score and updating the table rankings.

Conclusion and Next Steps

The goal of this project was to build a useful events recommendation engine for myself to find both professionally-focused and fun events locally. I also wanted to provide insight on the growth of transformer models and how these can be utilized in every step of an end-to-end application. I utilized an LLM for pre-processing scraped data, storing it, creating features, embedding them, and eventually creating recommendations with those embeddings. While there are costs to using these models, this project highlights how important these models can be if incorporated correctly in the overall architecture of a system.

--

--

Gaurav Mohan

I enjoy exploring how data science and computer vision can influence sports strategy. I also enjoy exploring the use cases of Generative AI in full-stack apps.