PJ Hoberman

Build a Knowledge Base That Compounds

PJ Hoberman — Tue, 07 Apr 2026 23:28:14 GMT

Andrej Karpathy recently posted about using LLMs to build personal knowledge bases. Fifteen million people saw it. His follow-up explicitly said "there is room here for an incredible new product instead of a hacky collection of scripts."

I've been running exactly this system for 29 sessions over the past five weeks. Not as a product, as a workflow. And the thing I've learned is that the value has almost nothing to do with the tooling. It lives in the rules you set before the first session.

Without a system, I was the same as everyone else: reading 10-20 technical sources a week, retaining maybe 10% within a few days, and losing the connections between ideas entirely. Bookmarks are where links go to die. My pocket / matter queue gets longer and longer and full of guilt.

So I built what I call a Signal Ledger. I drop sources into a conversation with Claude, and the LLM filters for signal relevant to my actual work. Not summaries. Distillations. The difference matters, and I'll come back to it.

Write the Rules Down Before Session 1

The single most important thing I did was write the rules down before processing a single source. I call it the contract, a document that tells the LLM exactly how to behave during every session.

The key clauses:

3-5 bullets per source, maximum. This forces compression. If the LLM can't distill a 3,000-word article into 3-5 bullets, it didn't understand the article.
Lead with actionable signal, not interesting signal. "This is a cool finding" is not the same as "this changes how you should build X." Actionable comes first.
Every entry gets a "so what for you" tied to an active project or theme. This is the part that makes the whole thing work. Every source has to connect to something I'm currently building or tracking. If there's no connection, the source might still be worth reading, but it's not signal for the ledger.
Negative signal matters. "This article wasn't useful because X" is explicitly part of the output. It calibrates future sessions and stops me from wasting time on similar sources later.

And then the distinction I mentioned: no summaries, only distillations. A summary tells you what the article said. A distillation tells you what it means for your work. A summary of a Stripe engineering post might say "Stripe built an internal coding agent that merges 1,300 PRs per week." A distillation says "Stripe's key insight is that their agent works because of infrastructure built for human engineers years before LLMs existed. The harness predates the model. Apply this to your own CI/CD before building agent tooling on top."

Without the contract, you get a glorified RSS reader. With it, you get an editor that applies judgment.

Themes and Corroboration

The core mechanism that makes this compound rather than accumulate is theme tracking with corroboration requirements.

Every signal gets classified into one of three tiers:

Parking Lot. A single-source signal. One person said something interesting. Might be noise. Might be early signal. It sits here until corroborated.

Active Theme (watch). Two sources from independent origins describing the same pattern. Worth tracking, not yet worth acting on.

Active Theme (confirmed). Three or more independent sources converging. Signal you can write about, build on, or cite with confidence.

The independence requirement is the key. Two blog posts that both cite the same viral tweet aren't independent. They're amplification of one signal. Three practitioners at different companies, different tech stacks, describing the same architectural pattern without referencing each other? That's convergence.

Concrete example: I track a theme called "Harness Engineering" that started as a single mention in Session 2. One practitioner's blog post arguing that the infrastructure wrapping an LLM matters more than the model itself. Interesting, single source, into the parking lot.

By Session 29, it has 59 independent sources. Practitioners at Stripe, Meta, Vercel, and Anthropic. Academic researchers. Solo developers. Open-source tool authors. A VC investor. An Anthropic safety researcher. All independently converging on the same pattern.

I didn't go looking for that theme. It emerged from the data. The corroboration requirement forced me to notice it organically rather than cherry-pick evidence for something I already believed. And that's the real reason for counting sources and tracking independence: it forces intellectual honesty about what you actually know versus what you want to be true.

What Compounding Looks Like in Practice

After 29 sessions across roughly five weeks, here's what's different.

I've processed over 200 sources. I track 11 active themes with named mechanisms and cross-references between them. I have a parking lot of 15+ single-source signals being watched. I have a backlog of writing candidates where each one has 10+ sources of evidence assembled before I write a word.

The qualitative shift matters more. I stopped chasing individual articles and started seeing patterns across articles. A new source on AI code review lands differently when I already have 25 sources on the broader pattern it fits into. Processing takes less time as sessions progress because existing themes provide immediate context. "This is the 4th independent source on comprehension debt" is more useful than "here's another article about AI coding risks."

My first published blog post after enacting this process drew from sources tracked across 8 sessions. I didn't have to go find supporting material after deciding to write. The ledger had already assembled it.

I think the difference is between reading and research. Reading without a system is consumption. Each article exists in isolation, competes with everything else you read that week, and fades. With a compounding system, today's reading makes tomorrow's reading more valuable because it either reinforces, nuances, or contradicts something you're already tracking.

This is what compounding looks like for me:

Theme	Sources	First Appeared
Harness Engineering	59	Session 2
Capability/Practice Gap	29	Session 1
Vibe Coding Risk	26	Session 3
BDD / Spec-First	16+	Session 3
Comprehension Debt	18	Session 17
Context Infrastructure	21	Session 4
Execution Layer > Model Layer	19	Session 7
Autonomous Compounding Loop	17	Session 12
Third Era	13	Session 3
Persistent Agent Memory	11	Session 3
Code Review as Delivery Chokepoint	4	Session 16

What Doesn't Work

The ledger gets unwieldy. Past roughly 50,000 words, appending to a single file causes performance issues. I added "theme docs" as a patch: standalone reference documents per theme that get rewritten when a theme evolves significantly. If you've built agent memory systems, this problem will sound familiar. Any append-only store needs a compaction mechanism eventually.

Confirmation bias is a real risk. When you have a named theme, you start seeing it everywhere. A source that vaguely touches the topic gets filed under the theme even when it's tangential at best. I added a monthly health check: a structured audit that reviews the entire ledger for contradictions, unsourced claims, stale themes, and drift between what the sources actually say and what my framing claims they say. The first one flagged two themes where I was stretching the source material. I wouldn't have caught it otherwise.

Not every session is high-signal. Some batches of 10 sources produce one actionable insight and nine entries of "this was noise." That's calibration, not waste. But it doesn't feel great in the moment.

The "so what for you" framing requires active projects. If you're not building anything, it collapses back into summarization. The system is for people who are working on things, not for passive readers.

Steal This

The specific tools don't matter. I use Obsidian and Claude, but the principles work with any LLM and any notes app. Here's what does matter:

1. Write a contract before Session 1. Format, length limits, what counts as signal, how to handle noise. Write it down. The LLM should follow these rules every session without you re-explaining them. Here's a version of the contract I use - for it and make it your own.

2. Require a "so what for your work" on every source. This kills the instinct to collect interesting things and replaces it with a filter for useful things. If a source can't be tied to something you're building, it might still be worth reading. It's just not signal.

3. Track themes with corroboration requirements. Don't promote a signal to "something I believe" until three or more independent sources converge. Count the sources. Track whether they're independent. This is what separates a knowledge base from a collection of highlights.

4. Log negative signal. "This wasn't useful because the author conflated two different patterns" teaches the system and you what to skip next time.

5. Audit yourself. Monthly health checks. Drift detection. Review your parking lot. Stuff that's been sitting there for weeks without corroboration was probably noise.

6. Each session should make the next session faster. If your 20th session takes the same effort as your 5th, you're accumulating, not compounding. Existing themes should provide context that accelerates processing. If they don't, something's off.

Where I Am Now

Karpathy said there's room for a product here. Maybe. But after 29 sessions, I think the product question is less interesting than the workflow question. A tool that summarizes your reading is an RSS reader with a language model. What changes the dynamic is enforcing corroboration requirements, tying every input to active projects, and auditing yourself for confirmation bias. Those are editorial decisions, not features.

The ledger now tracks connections I've forgotten and assembles evidence bases for things I haven't written yet. I don't read articles the same way I used to. I read them as potential entries in a system that's been building context for weeks.

That's the workflow. Tools are up to you.

I Ran 60 Autoresearch Experiments on a Production Search Algorithm. Here's What Actually Happened.

PJ Hoberman — Tue, 24 Mar 2026 02:37:38 GMT

Everyone's writing about Karpathy's autoresearch. Most of it is "here's how the loop works" or "imagine the possibilities." I wanted to see what happens when you point it at a real codebase with a real metric, not a training script. I wanted to try it!

So I ran two rounds. 60 total iterations. The first round improved things. The second round found nothing - and that turned out to be even more interesting.

The System

I work on a hybrid search system: Cohere embeddings in pgvector for semantic similarity, then a keyword re-ranking layer on top. Django, PostgreSQL, Bedrock. The kind of search stack a lot of teams are probably running right now.

The ranking logic lives in one file: utils.py. It takes the top 100 vector search candidates, scores them on keyword and tag matches across location, activity, and general terms, normalizes everything with z-scores, applies adaptive correlation-based weighting to avoid double-counting, and combines it all into a final score: similarity * (1 + keyword_boost).

There are a lot of knobs. Base weights for three query types. A scoring formula for body keyword matches. Z-score clipping bounds. A correlation shrinkage function. The final combination formula. All hand-tuned. All "seems about right."

Perfect autoresearch target.

The Setup

The autoresearch pattern is simple: one file, one metric, one loop. The agent edits the file, runs the eval, keeps improvements, reverts failures, repeats.

Here's what I set up:

The constrained file: utils.py — ranking logic only. The embedding service, query metadata extraction, database schema — all frozen.

The metric: A composite score: 80% Precision@12 (how many of the top 12 results are actually relevant) and 20% MRR (is the best result near the top). I weighted it this way because MRR was already at 0.975 — almost every query already had the right #1 result. The room to improve was in the rest of the top 12.

The test set: 20 queries across three types (location, activity, general) with hand-labeled expected results. Things like "best hiking trails near Aspen Colorado," "beginner backpacking gear list," "avalanche safety backcountry skiing." I ran each query, looked at the top 50 results, and picked the ones that actually answered the question.

The eval caching trick: Each query hits Bedrock twice (query metadata + embedding). That's 15 seconds per query. But the agent only modifies the ranking logic — the embeddings and metadata don't change between iterations. So I cached all the API results on the first run and monkey-patched them in on subsequent runs. Eval went from 6 minutes to about 30 seconds.

I wrote an instructions.md that told Claude Code exactly what it could touch, what it couldn't, and what strategies to try in roughly what order. Here's the skeleton:

## The Constrained File
- `src/service/utils.py` — ONLY file you may edit

## What You Cannot Modify
- eval script, test queries, embedding service, cache files, schema

## Eval
- `uv run manage.py run_autoresearch_eval`
- SCORE = 0.8 * Precision@12 + 0.2 * MRR

## Strategy Guidance (roughly in this order)
1. Quick wins: base weights, pool size, zclip range
2. Scoring function: damping, formula shape
3. Weight optimization: fine-tune per query type
4. Experimental: combine best ideas

## Do NOT
- Add API calls or new dependencies
- Edit frozen files
- Spend 3+ iterations on the same dead approach

Then I walked away. Literally, I went and played with my kids.

Round 1: The Results

Baseline: 0.6933 composite (P@12: 0.6292, MRR: 0.9500)

Final: 0.7200 composite (P@12: 0.6500, MRR: 1.0000)

44 iterations. 3 kept. 41 reverted.

Let that sink in: 93% of experiments made things worse or changed nothing.

Here are the three changes that survived:

1. Bigger base weights, scaled by query type. Location queries got 5x the original weight. Activity queries 3x. General 2x. The system had been under-weighting the keyword signals relative to the embedding similarity.

2. Exponential scoring formula. Swapped (1-d) * (1+boost) for (1-d) * exp(boost*0.3). Better separation between boosted and unboosted items. This also fixed the one query where MRR wasn't perfect.

3. Higher general weights. Pushed 5x on the general query type weights specifically, which improved "best hikes in the world" from P@12 0.667 to 0.750.

None of these are surprising in hindsight. That's kind of the point.

What Didn't Work (the Actually Useful Part)

This is where the value is.

Bigger candidate pools don't help. I expected that going from 100 to 150 or 200 re-ranking candidates would surface articles that were just barely outside the original pool. Nope. The expected articles were already in the top 100 by vector distance. The problem was ranking, not recall.

Title matching is noise. Seemed like a slam dunk — articles with query terms in the title should rank higher, right? In practice, tons of irrelevant articles also have those terms in their titles. Net negative.

Disabling adaptive weighting hurts. The correlation shrinkage I'd built in (reduce keyword weight when keywords correlate with embedding similarity) was actually pulling its weight. Removing it caused regressions.

Keyword density scoring backfires. Normalizing keyword counts by article length seemed smart. It wasn't. Shorter articles aren't more relevant — they just have fewer words.

Body keyword damping doesn't matter. Whether you use 1 + log1p(count) * 0.5 or 1 + log1p(count) * 0.3 or min(count, 3) — the scores barely move. The exact damping formula is not where the signal is.

Each of these would have been a reasonable thing to try manually. And each would have taken 15-30 minutes of "change, test, evaluate, think about it, decide." The agent burned through all of them in a few hours and proved definitively that they're dead ends.

And fiddling with just the weights would have taken forever if I even bothered going that far. Instead, we basically have LLM-led gradient descent:

Round 2: Optimizing the Prompt (and Finding Nothing)

Round 1's final log said the ceiling was upstream — the quality of keywords extracted by Claude Haiku from the user's query. So I ran a second round targeting the Haiku prompt in embedding_service.py. Same test set, same metric, Round 1's ranking changes frozen.

16 iterations. Zero improvements. But two findings that were worth the entire round:

The Redis trap. The metadata extraction function caches results by hash(query) — not hash(query + prompt). My first two iterations showed improvements that weren't real. The eval was reading cached metadata from the old prompt. I only caught it when I cleared the cache manually and the "improvements" vanished. If you're running autoresearch on anything with a caching layer, make sure the cache key includes everything that could change between iterations.

The co-optimization ceiling. Round 1 tuned the ranking weights to work with the specific metadata distribution the original prompt produces. Changing the prompt changes that distribution, and the frozen ranking can't adapt. Every prompt change that improved location queries degraded activity queries. The two components were coupled, and optimizing them sequentially hit a wall that optimizing them together wouldn't have.

This is the thing nobody mentions about autoresearch: sequential rounds have a structural ceiling. Round 1 overfits to the current state of the frozen components. Round 2 can't improve those components without undoing Round 1's gains. If you're planning multi-round autoresearch, either co-optimize both components in one round, or know that each round's ceiling will be lower than the last.

Where the Ceiling Is

After 60 iterations across two rounds, the score settled at 0.72. The ranking math is near-optimal. The prompt is at a Pareto boundary. The remaining weak queries are ones where the right articles are far away in embedding space — "bike packing routes Pacific Northwest" returns road trip content because the embeddings think those are similar.

The next improvement needs a cross-encoder re-ranker or a better embedding model. That's a different project, not a different autoresearch run.

Was It Worth It?

For the ranking improvements alone? Probably not. A +0.03 on a composite score is real but marginal.

For the knowledge? Absolutely. I now know, with 60 data points, that my ranking logic was already close to optimal, that the adaptive weighting I built actually works, that keywords are essentially decorative in this system (the embeddings do all the real work), that the Redis cache doesn't key on prompt changes, and that my next improvement has to come from the embedding layer.

I would not have arrived at any of that from manual tuning. I would have tried 8-10 things, gotten frustrated, and moved on with lingering uncertainty.

The autoresearch pattern works best not when it finds big wins, but when it maps the ceiling of a system. "You can stop tuning this" is an underrated finding.

If You Want to Try This

You don't need GPUs. You don't need an ML training loop. You need:

One file the agent can edit
One metric that goes up when things get better
A fast eval (cache everything that doesn't change between iterations)
An instructions.md that tells the agent the rules

Write the eval first. Label some test data. Cache your API calls. Then let it run.

I've open-sourced the skill I built from this experiment as a Claude Code plugin: pjhoberman/autoresearch. It generates the full experiment harness (instructions, eval script, test data template, launch prompt) scoped to your codebase. The references/lessons.md file has everything I learned from both rounds.

The hard part isn't the loop. It's writing an eval that actually measures what you care about.

I Built a Local-First HSA Receipt Tracker with Flask, Google Drive, and AI

PJ Hoberman — Mon, 09 Mar 2026 16:30:00 GMT

A tiny local python app that turns HSA receipts into structured records in Google Drive and Sheets in about five seconds.

Last year my family switched to a High Deductible Health Plan (HDHP) with a Health Savings Account (HSA).

HSAs are unusual in the U.S. tax system because they’re triple tax advantaged:

contributions are tax-deductible
investments grow tax-free
withdrawals for qualified medical expenses are tax-free

The Bogleheads wiki has a great explanation if you're curious:

https://www.bogleheads.org/wiki/Health\_savings\_account

Because of those advantages, some people treat their HSA as a long-term investment account instead of reimbursing medical expenses right away. Almost everyone I tell about this has a 🤯 moment because they’ve never heard of it. So, if you're one of those lucky people to learn about it today, you're welcome!

Your wallet loves this one weird trick!

The strategy looks like this:

Contribute to the HSA
Invest the money
Pay medical expenses out-of-pocket
Save the receipts
Reimburse yourself years (or decades) later, or don't
Profit

There’s no deadline to reimburse yourself as long as:

the expense happened after the HSA was opened
you kept documentation

That last part is the problem.

Here’s what the result looks like after uploading a receipt:

Every receipt automatically becomes a structured record. Each row links back to the original receipt stored in Google Drive.

Date	Vendor	Amount	Receipt
2026-02-14	Quest Diagnostics	$87.43	Drive Link
2026-02-03	Walgreens	$14.29	Drive Link
2026-01-19	Dentist	$120.00	Drive Link

The Receipt Problem

If you're saving receipts for potential reimbursement later, you need to keep track of:

provider
date
amount
proof of payment
the original receipt

Manually this usually turns into:

PDFs scattered across downloads folders
random email attachments
a spreadsheet you forget to update

I was doing this manually with Google Drive and a spreadsheet, and it wasn't too hard, but... I wanted something simpler.

Drop in a receipt → have everything filed automatically.

The Idea

I built a tiny local-first web app that:

accepts a medical receipt PDF
extracts useful fields automatically
stores the document in Google Drive
logs the expense in Google Sheets

The whole flow takes about 5 seconds of actual work.

Demo

Here’s the entire workflow in real time (~5 seconds of actual work - the rest is me moving slowly for the video's sake):

{% embed https://vimeo.com/1171880979?share=copy&fl=sv&fe=ci %}

Upload a receipt → parsed → stored → logged.

The Stack

The app is intentionally small:

Flask – local web app
pdfplumber – extract text from receipts
OpenAI (optional) – prefill receipt fields
Google Drive API – store receipts
Google Sheets API – expense log

The workflow looks like this:

receipt → upload → text extraction → AI field parsing → Drive storage → Sheets entry

What Happens When You Upload a Receipt

When I submit a receipt, the app:

saves the uploaded PDF locally
computes a SHA-256 hash to detect duplicates
extracts text using pdfplumber
optionally asks the model to extract fields like:

vendor
service date
amount
payment date
payment method
notes

creates (or reuses) a Google Drive folder for the month:

2026-03/
uploads the receipt there
appends a row to Google Sheets with a link to the file

Example spreadsheet row:

Date	Vendor	Amount	Receipt
2026-02-14	Quest Diagnostics	$87.43	Drive Link

Now every expense has:

structured data
the original document
a searchable log

One Small Feature That Ended Up Being Useful

Duplicate detection.

Each uploaded file is hashed and stored in a local receipt_hashes.json.

If I accidentally upload the same receipt twice, the app can catch it before cluttering the spreadsheet or Drive folder.

Tiny detail, but it prevents a lot of mess.

Why Local-First?

This could easily be a SaaS product.

But for personal admin tools, local-first is usually better:

no extra account to maintain
easier to hack on
my files stay in my own Google account
no database or infrastructure to run

I also set it up as a macOS launchd service, so it’s just always available on my laptop.

The goal wasn’t to build some big product. It was to remove just enough friction that I’d actually keep my receipts organized.

Now the workflow is:

Download receipt ↓ Upload ↓ Done

Which is about the amount of effort I’m realistically willing to spend on personal finance paperwork.

Repo

If you're interested:

https://github.com/pjhoberman/hsa-tracker

If you’re using the "save receipts and reimburse later" HSA strategy and built something similar, I’d love to hear about it.

Or if you have a better way to automate this - please tell me before I add OCR and accidentally build an entire product.

Using Claude to manage.. me

PJ Hoberman — Fri, 22 Aug 2025 16:30:00 GMT

I have a problem.

I like to do a lot of things at once, and I like shiny objects. I can easily get caught up in some novel task, while tabbing through four docs I'm supposed to review, waiting for tests to pass on CI.

And I work across a few teams and projects.

And I like to learn new things.

And and and, sometimes I have trouble remembering what my priorities are.

Recently, I thought I'd try Claude to help manage me. My various priorities, projects, docs to review, PRs in flight, all of that.

The Setup

Using Claude Desktop, I hooked it up to Linear via MCP. I started a project in Claude just for this effort. And then I told it all the things I'm working on, with links where applicable.

I started off with project instructions:

You help me remember what i'm working on, the work I owe people, etc. You should use the MCP connection to Linear to find tasks and evaluate my work and the work my team is doing as well.

Then I told it all the docs I need to review, and asked it to look at Linear as well.

Then I told it about a side project and fed it a bunch of links related to it:

I'm working on some side work called ChoreBot. Keep track of that as a separate project. As part of that work, I have these tabs open:

Imagine 15 tabs I had open

I told it about other tasks, tabs, projects, and general work on my plate.

Finally, I had to summarize everything and create a document I could add as an artifact for the project.

Later, I worked with the project to give some prioritization to each task and project.

And again, I had Claude turn the results of this exercise into a document I could add to the project.

The End Result

Now, the benefit of this setup. When new tasks come up, I let Claude know. And for example, today, a Friday, nearing the end of the month, I can do something like this:

Me:

It's Friday, I want to work on something fun but impactful. What's on the agenda?

Claude:

Let me check your current Linear tasks to see what's on your plate and then suggest something fun but impactful for your Friday!Perfect! Looking at your Linear tasks and comparing with your priority documents, I can see some great options for fun but impactful Friday work. Let me suggest a few based on what you have:

🎯 Most Fun & Impactful Options for Friday:

1. ChoreBot Development (Your baby! 🤖)

This isn't in Linear yet but it's one of your August commitments
High impact AI DevOps automation - exactly the kind of cutting-edge work that's both fun and strategically valuable
You could work on the POC, maybe starting with the monitoring phase (Sentry/Datadog integration)

2. DISCO-3746: AWS EventBridge Spam Detection (1 point, High Priority)

DevOps + AWS work - technical and satisfying
Setting up EventBridge rules for spam detection sounds like a fun infrastructure challenge
Quick win that unblocks the UGC team

3. DISCO-3786: Redis Connection Pools (High Priority, no estimate)

Infrastructure optimization work - always satisfying to see performance improvements
Builds on your Redis expertise from the vector DB research
Could tie into your ChoreBot work later

🤔 What I Notice:

You've been crushing the feed digest work! Tons of releases this week (DISCO-3791, 3790, 3787, etc.)

Your package update tasks (DISCO-3705, 3785) keep getting pushed - might be worth tackling one of these for a maintenance win
The cron monitoring script (DISCO-3779) is "In Progress" - could be a good Friday finish

🚀 My Recommendation:

Start ChoreBot POC development - it's not urgent deadline-wise, but it's your most strategic and fun project. You could:

Set up the basic monitoring integrations (Sentry/Datadog webhooks)
Create the Linear ticket auto-creation logic
Maybe even start on some MCP protocol research

This hits your "fun but impactful" criteria perfectly and aligns with your August commitments!

What sounds appealing to you? Want to dive into ChoreBot, or would you prefer one of the more immediate Linear tasks?

Conclusion

We'll see how long I use this and if it is the silver bullet I need to organize my sprawling task lists, but so far, I like it! It seems to bridge the gap between a bunch of todo lists and linear tasks and what my former self thought each priority entailed.

What do you think? Have you used Chat like for something like this before?

Changing how we do standups

PJ Hoberman — Mon, 11 Nov 2024 17:30:00 GMT

This meeting needs a refactor.

Our team used to do the daily standup. Everyone talks about what they did yesterday, what they're doing today, and any blockers. Everyone tunes out except for their own status update. The product manager and the tech lead pay some attention. Maybe once in a week or two, something interesting is said. 🥱

During a retro in the midst of a busy time for our team, we surfaced that we had too many meetings. And we're not alone in this. It was just too much with the looming deadlines. Shopify went extreme on dropping meetings a few years back. The daily standup was one of many meetings we talked about, and everyone had the same feeling: this meeting needed a refactor.

We discussed what the actual intent of our daily standup was, and how we could preserve that while also reducing the pressure of meetings. Here's what we came up with:

Daily Standup Goals

The product manager needs to know the status of certain tasks beyond what Linear reports. Not all tasks, not every day, but it's important for them to have their finger on the pulse.
The tech lead also needs to get a sense for people spinning their wheels, working on things for longer than expected, and how projects are moving along.
People need help getting unblocked. A topic for another time is how long to try to unblock yourself before asking for help. But, our team is generally good at reaching out outside of meetings. However, sometimes we need a dedicated time and place to make sure a block is in front of everyone.
We're a fully remote team. We need face time. We need to bond and build / maintain team culture. But we don't want more meetings.

Our Team's Solution

Daily SlackBot async standup. A bot sends a message to our team handle in slack, and we respond with Yesterday, Today, and Blockers. Sometimes it's just a status update, sometimes it's a funny quip about our kids being blockers. Conversation sometimes ensues. Other times, it doesn't. And that's perfect.
On Mondays, Wednesdays, and Fridays, we have a 15 minute team sync. This is meant to be a chat about what we're working on that's interesting. Blockers that we want to dive deep on or just vent. And sometimes it's just a coffee chat, and we talk about plans for the weekend or some cool new toy.

This dropped a mere 30 minutes of meetings from the week, but we don't dread the remaining 45 minutes anymore. The time is useful for everyone.

All our goals are met, we have fewer standup meetings, and the meetings we do have show much higher engagement and enjoyment. Win win win!

What do you think? Do you find the daily stand up useful? Who is it for in your team? How do you build and reinforce culture on your remote team?

Finding Unused Indexes in Postgres

PJ Hoberman — Wed, 18 Sep 2024 16:30:00 GMT

Database indexes are incredibly important in production systems. Single column ones are obvious, but multi-column (composite) indexes are game-changers for speeding up commonly used queries by several orders of magnitude.

But sometimes, we (I) create a few different indexes over time, and it's not always obvious when to remove old indexes. Indexes take up space and slow down writes, so it's important to monitor and clean them up periodically.

Here is a postgres query to give you some insight:

SELECT 
    relname AS table_name, 
    indexrelname AS index_name, 
    idx_scan AS index_scans, 
    idx_tup_read AS tuples_read, 
    idx_tup_fetch AS tuples_fetched
FROM 
    pg_stat_user_indexes
JOIN 
    pg_index ON pg_stat_user_indexes.indexrelid = pg_index.indexrelid
WHERE 
    schemaname = 'public'
ORDER BY 
    idx_scan DESC;

The output will be the table name, index name, and some data about each index including how many times the index was used in a query, how many tuples (index rows in this case) were read from the index, and how many tuples were actually fetched after all filtering was complete.

The output will contain data since the server was last restarted or the statistics were last reset. Here is some example output from a production server. I removed a bunch of rows to show heavily used indexes and some that aren't used at all:

 table_name    |          index_name         | index_scans | tuples_read | tuples_fetched 
---------------+-----------------------------+-------------+-------------+----------------
 items         | items_pkey                  | 17566068467 | 22444742841 | 21762928697
 routes        | routes_item_id_key          |  4046022477 |  2541792837 | 2521785009
 items         | items_url_idx               |  1520426292 |  7556543480 | 1518612148
 authors       | authors_pkey                |   211481111 |    45577051 |   42726045
 logs          | logs_type_coord_uniq        |     6437114 |     1462603 |     1392484
 spatial_ref   | spatial_ref_pkey            |     2060726 |    13792886 |     2056566
 users         | users_pkey                  |     1872578 |     2214935 |     1872578
 ...
 rate_limits   | rate_limit_key_like         |           0 |           0 |           0
 blocks        | blocks_pkey                 |           0 |           0 |           0
 blocks        | blocks_uniq                 |           0 |           0 |           0

As you can see, some of these indexes are never used! I have some work ahead of me now: manually review these indexes and potentially remove unused ones to save on space and improve write performance. Eventually, I'd like to set up automated alerts to let me know that some indexes aren't being used at all.

Let me know if this is helpful for you or if you've gone further and automated anything like this!

LaunchDarkly across multiple celery tasks

PJ Hoberman — Fri, 21 Jul 2023 16:30:00 GMT

We were running into an issue recently where LaunchDarkly wasn't evaluating on celery servers. Opening a python shell on the boxes showed the keys were setup correctly, and that if we called the methods directly, the flags evaluated as expected. However, when called as a delayed task, the flags weren't evaluating.

LaunchDarkly makes use of fork in python, and requires that only one LaunchDarkly client instance exists. This post is a good primer on forking in python. It appeared that this was the issue.

My coworker Doug explained it thusly:

The LaunchDarkly library makes use of threading, Celery starts the main "control" process that uses fork() to start n workers, based on your concurrency setting.

Forking copies the current process into a new one, but in Python it kills off all but the thread doing the forking. All others are stopped. So the threads that LaunchDarkly starts up during initialization (e.g., EventDispatcher or StreamingUpdateProcessor) end up "defunct" or unpredictable.

The Locks used within LaunchDarkly are thread independent, but because threads are killed off in the child, you end up with an invalid state and can’t trust things will work.

Further, LaunchDarkly recommends a post fork hook to initialize the client.

import uwsgidecorators

@uwsgidecorators.postfork
def post_fork_client_initialization():
    ldclient.set_config(LDConfig("sdk-key-123abc"))
    client = ldclient.get()
end

However, our Django application uses asgi, which doesn't currently have this hook. This is our current LaunchDarkly configuration launch_darkly.py:

import atexit
import sys

from django.conf import settings

"""
Sets up Launch Darkly for use across the site.
LD is already initialized. See discovery_service.views.ld_check for example usage.
"""


class LDClient():
    def __getattr__(self, v):
        if 'ldclient' in sys.modules:
            import ldclient
        else:
            import ldclient
            from ldclient.config import Config
            ldclient.set_config(Config(settings.LAUNCH_DARKLY_SDK_KEY))
        return getattr(ldclient.get(), v)


ld_client = LDClient()

@atexit.register
def close_ld(*args, **kwargs):
    # LD recommends closing upon app shutdown
    # https://docs.launchdarkly.com/sdk/server-side/python
    ld_client.close()

The LDClient class allows us to ignore new instantiations of the ldclient library if it's already been loaded.

And the general use is:

from launch_darkly import ld_client

def flagged_code():
    flag = ld_client.variation("flag-name", {"key": 12345}, False)  # False is the default in this case
    if flag:
        // do something if the flag is on
    else:
        // do something else if the flag is off

After a lot of bashing through walls, aka iterative development, we discovered two things:

1. Module Level instantiation

There was a module-level instantiation of the LaunchDarkly client that was causing the library to initialize before the fork.

Basically, the above code, but instead:

from launch_darkly import ld_client

flag = ld_client.variation("flag-name", {"key": 12345}, False)      # False is the default in this case


def flagged_code():

    if flag:
        // do something if the flag is on
    else:
        // do something else if the flag is off

So that code was removed / refactored.

2. Celery initialization

In our celery.py code, we added a worker_process_init hook to initialize the library properly. This ensures that when the celery workers fork, there is definitely a ldclient ready to go for any code that requires it.

@worker_process_init.connect
def configure_worker(signal=None, sender=None, **kwargs):
    """Initialize the Launch Darkly client for use in Celery tasks."""
    try:
        res = ld_client.variation("test-flag", {"key": 0}, 0)
        logging.info(f"LD client initialized for Celery worker. {res}")
    except Exception:
        import traceback
        traceback.print_exc()
        logger.error("Error initializing LD client for Celery worker.", exc_info=True)

To aid in future discovery and debugging, we also created a celery task that we can call on the fly to make sure things are working:


@shared_task
def celery_ld_check(flag="test-flag", key=0, default="not found"):
    """
    Test LaunchDarkly SDK connectivity from Celery.
    """

    print("trying celery_ld_check")
    try:
        variation = ld_client.variation(flag, {"key": key}, default)
        print(f"celery_ld_check: {variation}")
    except Exception as e:
        print(f"celery_ld_check: {e}")

Lastly, we will likely iterate on the LDClient class to deal with issues regarding the fork on the fly.

Let me know if this helps you in your code, or sparks any ideas for you!

PJ Hoberman

Build a Knowledge Base That Compounds

Write the Rules Down Before Session 1

Themes and Corroboration

What Compounding Looks Like in Practice

What Doesn't Work

Steal This

Where I Am Now

I Ran 60 Autoresearch Experiments on a Production Search Algorithm. Here's What Actually Happened.

The System

The Setup

Round 1: The Results

What Didn't Work (the Actually Useful Part)

Round 2: Optimizing the Prompt (and Finding Nothing)

Where the Ceiling Is

Was It Worth It?

If You Want to Try This

I Built a Local-First HSA Receipt Tracker with Flask, Google Drive, and AI

The Receipt Problem

The Idea

Demo

The Stack

What Happens When You Upload a Receipt

One Small Feature That Ended Up Being Useful

Why Local-First?

Repo

Using Claude to manage.. me

The Setup

The End Result

🎯 Most Fun & Impactful Options for Friday:

🤔 What I Notice:

🚀 My Recommendation:

Conclusion

Changing how we do standups

Finding Unused Indexes in Postgres

LaunchDarkly across multiple celery tasks

1. Module Level instantiation

2. Celery initialization