Reverse Engineering YC's job directory aka Work At A Startup

This tutorial demonstrates advanced web scraping techniques including CSRF token management and session handling. Use responsibly and respect rate limits.

If you have ever explored startup job opportunities, you might have come across Work at a Startup, a platform that aggregates job postings from startups. While working on a side project, I wanted to automate the process of extracting structured data from this platform to analyze job trends and hiring patterns.

This project required understanding both front-end data structures and back-end API patterns to build a robust scraping solution.

In this blog, I document my journey of reverse engineering the platform to extract job data using Python and JSON.

Step 1: Exploring the Network Requests

The first step was to identify how the data is fetched and displayed on the platform. By inspecting the network requests in Chrome DevTools, I discovered that the job data is loaded dynamically via JSON files. This was a great starting point for building my script.

Open Chrome DevTools and navigate to the Network tab
Filter by 'Fetch/XHR' to see API calls
Look for JSON responses that contain job data
Identify patterns in request headers and parameters

Pro Tip: Use Chrome DevTools Network tab with "Fetch/XHR" filter to easily identify API endpoints.

Step 2: Understanding the Algolia Query

While analyzing the network requests, I discovered that the platform uses Algolia, a popular search infrastructure platform, to manage its job search index. The JSON payload sent to Algolia specifies attributes such as roles, locations, and job types. Here is an example of the query:

Algolia Query Structure
{
  "requests": [
    {
      "indexName": "WaaSPublicCompanyJob_created_at_desc_production",
      "params": "query=&page=0&filters=&attributesToRetrieve=%5B%22company_id%22%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&hitsPerPage=10&clickAnalytics=true&distinct=true"
    }
  ]
}

This structured query allows for efficient filtering and retrieval of job data.

Step 3: Writing our own Queries

To interact with the Algolia API, I wrote a Python script to query the job search index. Here is the core logic:

Basic Job Data Extraction
import json

# Load the JSON file

with open('waas_jobs_2025-07-09.json', 'r') as file:
jobs = json.load(file)

# Extract and print job details

for job in jobs:
print(f"Company: {job['company']}")
print(f"Title: {job['title']}")
print(f"Location: {job['location']}")
print(f"Salary: {job['salary']}")
print(f"Hiring Manager: {job['hiring_manager']}")
print(f"Apply URL: {job['apply_url']}")
print("---")

This script reads the JSON file, iterates through the job postings, and prints the relevant details.

Step 4: CSRF Token Management and Session Handling

One of the biggest challenges when scraping Work at a Startup is handling authentication and CSRF (Cross-Site Request Forgery) protection. The platform implements robust security measures that require proper session management.

Security Note: CSRF tokens are security measures designed to protect against malicious requests. Always respect website terms of service and implement responsible scraping practices.

The Challenge

Initially, I was using hardcoded CSRF tokens and session cookies, which would expire after some time, causing the scraper to fail. This approach was fragile and required manual updates whenever tokens expired.

❌ Old Fragile Approach
# Old approach - fragile and manual
COMPANY_HEADERS = {
    "x-csrf-token": "hardcoded-token-that-expires",
    # ... other headers
}

The Solution: Dynamic CSRF Management

To solve this, I implemented a WaaSClient class that automatically handles CSRF token extraction and session management, inspired by modern web scraping patterns:

✅ Enhanced WaaSClient Implementation
class WaaSClient:
    def __init__(self):
        self.csrf_token = ""
        self.session_cookies = {}
        self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    
    def init(self):
        """Extract CSRF token from homepage"""
        resp = requests.get("https://www.workatastartup.com", 
                          headers={"User-Agent": self.user_agent})
        
        # Parse HTML to extract CSRF token
        soup = BeautifulSoup(resp.text, 'html.parser')
        csrf_meta = soup.find('meta', attrs={'name': 'csrf-token'})
        self.csrf_token = csrf_meta['content']
        
        # Save session cookies
        for cookie_name, cookie_value in resp.cookies.items():
            self.session_cookies[cookie_name] = cookie_value
    
    def refresh_csrf(self):
        """Refresh expired CSRF token"""
        resp = requests.post("https://www.workatastartup.com/verify-session",
                           headers={"X-CSRF-Token": self.csrf_token})
        if resp.status_code == 422:
            self.csrf_token = resp.text.strip()
    
    def fetch(self, path, method="GET", **kwargs):
        """Make request with automatic CSRF handling"""
        # First attempt
        resp = requests.request(method, f"https://www.workatastartup.com{path}", 
                              headers={"X-CSRF-Token": self.csrf_token}, **kwargs)
        
        # If 422 error, refresh token and retry
        if resp.status_code == 422:
            self.refresh_csrf()
            resp = requests.request(method, f"https://www.workatastartup.com{path}",
                                  headers={"X-CSRF-Token": self.csrf_token}, **kwargs)
        return resp

Key Benefits

Automatic Token Extraction: Dynamically fetches CSRF tokens from the homepage
Session Persistence: Maintains session cookies across requests
Auto-Recovery: Automatically refreshes expired tokens and retries failed requests
Fallback Support: Gracefully falls back to manual methods if initialization fails

This approach makes the scraper much more robust and eliminates the need for manual token management.

Step 5: Getting LinkedIn Profiles

Many job postings include information about the founders and hiring managers, along with their LinkedIn profiles. By extracting this data, I was able to create a list of profiles for networking and further analysis. Here is an example of the extracted data:

Founder Profile Data Structure
{
  "name": "Dorothea Koh",
  "linkedin": "https://www.linkedin.com/in/dotkoh/",
  "bio": "Moon walker. Passionate about impacting healthcare in large emerging markets. ",
  "past_schools": "Bioengineering at Stanford University; Biomedical Engineering, Economics at Northwestern University; BS, Biomedical Engineering, Economics at Northwestern University"
}

This information can be used to understand the background of key individuals at startups and build meaningful connections.

Closing Reflections

This project was a fascinating dive into reverse engineering and data analysis. By automating the extraction of job data with robust CSRF token management, I was able to build a reliable scraper that can handle long-running data collection tasks.

Key Learnings

Security Matters: Modern web platforms implement sophisticated protection mechanisms
Dynamic Extraction: Hardcoded tokens are fragile; dynamic extraction is more reliable
Error Recovery: Automatic retry mechanisms make scrapers more robust
Session Management: Proper cookie handling is crucial for sustained scraping

Next Steps

Future improvements could include:

Building a web interface to visualize the job data and trends
Implementing automated daily updates to track new job postings
Adding rate limiting and respectful scraping practices
Expanding the analysis to include salary trends and hiring patterns
Creating alerts for specific job criteria or companies

Technical Stack

The final implementation uses:

Python for core logic and data processing
curl-cffi for browser-like HTTP requests with fingerprinting
BeautifulSoup for HTML parsing and CSRF token extraction
pandas for data manipulation and CSV export
JSON for structured data storage

Happy scraping! 🚀

If you found this exploration helpful or have suggestions for improvements, feel free to reach out! The complete code demonstrates how to build production-ready web scrapers with proper session management.

Tech Stack

PythonBeautifulSoupcurl-cffiAlgoliaJSON

Loading views...