Yet Another Blog Migration!

I’m moving all my blog content from Medium and WordPress. What better way to start than by describing the process?

Tales from the gaming industry for folks in tech

After a year at Singularity 6 working on Palia, I found myself answering the same questions from friends in tech: What’s different about gaming? What’s the same? This post shares many of the similarities and differences I experienced compared to other tech jobs.

Ubuntu 25 setup for a 2014 Macbook Pro

Apple stopped providing OS updates for my old MacBook Pro (Retina, 13-inch, Mid 2014, Intel CPU) even though the hardware is good enough for common web and development tasks. Eventually Chrome stopped providing updates for the OS version too. Then Docker stopped providing updates for it. Even DuckDB (a Python...

On retrospectives

I’ve had a number of conversations about retrospectives over the past few months, and until now I didn’t have anything written that I could share.

MLOps repo walkthrough

There’s a big difference between building a machine learning model that works on your computer, and making that model available for others to use. If implemented poorly, your users will be frustrated that your software isn’t reliable. And it can take months to implement it well!

MLOps Design Principles

There’s a big difference between building a machine learning model that works on your computer, and making that model available for others to use. If implemented poorly, your users will be frustrated that your software isn’t reliable. And it can take months to implement it well!

Machine translation for medical chat, checkpoint #4

This series of posts is about building trustworthy machine translation for medical chat. Multilingual doctors are rare and not all patients in the US speak English. However, just using machine translation isn’t enough; physicians often have concerns with safety and trust.

Lessons learned in 2022

At the end of each year, I like to reflect on my career and life. This post is meant to celebrate learning and growth, whether learning something new, or changing my mind as I gain more experience. I hope you’ll also take time to notice and celebrate your own growth...

Machine translation for medical chat, checkpoint #1

At my previous job, we provided primary care in a text chat between doctors and patients. It was on-demand, meaning that patients could show up anytime and get in line to chat with a physician. Occasionally we had challenges when a patient didn’t speak much English, for instance if they...

Machine translation class notes

I’d like to build a machine translation system for English-Spanish medical chat. But first, I need to brush up on machine translation.

Surprises in becoming an engineering manager

Recently I talked to a few folks that were curious about transitioning into software engineering management, and we talked about what it was like when I made that transition about 4 years ago. I’ll summarize most of those conversations in this post. I’ll focus on the surprises I found in...

Employee retention in Python

This post shows the math of employee retention, including sample code in Python that you can use for your own team.

Cloudwatch Custom Metrics in a Python Lambda

My team is responsible for developing, maintaining, and operating several web services that host our machine learning models. Periodically we need quick ways to check that our code is operating correctly. For example, a user may tell us that the software did something weird and we need to figure out...

Localization in Swype

I stumbled into a conversation about localization the other day and thought back to my time at Swype and Nuance, wishing I had an article about the challenges we faced. For some context, Swype was a keyboard app for Android that was acquired by Nuance in 2011 and continued to...

Is language modeling just RNNs now?

Language modeling is as much “just RNNs” as self driving cars are “just ConvNets.” It’s all the bits that you build on top of your function approximator, whether that’s an ngram model or recurrent neural network.

Tips for effective data science talks

In grad school I learned to practice my presentations before giving them. Teaching classes further reinforced the importance of communication and it’s served me well in industry.

So you're new to startups…

Periodically I help a friend transition from the medical or legal field to a tech startup. So I send them this list and even though it’s not nearly enough information, it’s enough that they can start learning on their own and start to ask really good questions.

Tuning dropout for each network size

In the previous post I tested a range of shallow networks from 50 hidden units to 1000. On the smaller dataset (50k rows) additional network complexity hurts: It’s just overfitting. On the larger dataset (200k rows) the additional complexity helps because the amount of data prevents the network from overfitting.

Switching from deep to wide

In the previous post I found gains by adding a second hidden layer. But I accidentally found even better results with wider networks of a single hidden layer. I’ve done more systematic experimentation and wanted to share. Just as a reminder this is a part of my ongoing project to predict the winner...

Gains from deep learning

Back from the holidays! I’ve finally made some progress with neural networks, particularly a deep network. This is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories. Previously I’d been working on ensemble...

Ensembles part 2

I’ve been using ensembles of my best classifiers to slightly improve accuracy at predicting League of Legends winners. Previously I tried scikit-learn’s VotingClassifier and also experimented with probability calibration.

Ensemble notes

I thought probability calibration would be difficult but it’s pretty easy. My ensemble code looks like this:

Better predictions for League matches

I’m predicting the winner of League of Legends ranked games with machine learning. The models look at player histories, champions picked, player ranks, blue vs red side, solo vs team queue, etc. The last time I wrote about accuracy improvements my best was 61.2% accuracy with gradient boosting trees.

Feature scaling is important, but not how I expected

Currently I’m getting up to speed with the Keras library for neural networks. After about a day and a half of effort I have a neural network that’s tied with my best results ever for predicting the winner of League of Legends matches.

Bigger League of Legends data set

Riot granted me an app key so I can crawl a lot more data. The downside is that I had to re-engineer much of my system because I couldn’t use free MongoLab tier with that much data. To give some ballpark sense, my mongo data directory is 46gb for 1.8...

Predicting League match outcomes: Week 2

I’ve continued to log my experimental accuracy in predicting League of Legends matches (see part 1) and this graph picks up from where I left off last time (around 64% accuracy).

Predicting League match outcomes: Gathering data

I’d like to take the results of pick/ban phase of professional League of Legends matches and compute the probability of the winner. In part I find it interesting to watch analysis of pick/ban phase by regular casters or Saint’s VOD reviews. How much of the game is really determined at pick/ban...

Getting users via Reddit

It’s tempting to focus purely on the engineering or research of a project. Hmm tempting isn’t the right word… it’s the default approach. In a typical software engineering or research job, you’re trained to leave other aspects of the project to marketing/business/etc.

Question processing for factoid search

Searchify was a project to enable quick factoid lookup on mobile advertisements. A full screen ad would have a search box and you could get quick answers without leaving the ad or even the app. Previously I’ve written about building synonyms for automotive in Searchify.

Synonyms for factoid search, Part 3

In the previous two posts I described 1) our problem and initial simple approaches and 2) WordNet-based solutions. Now I’m finally writing up our best solutions: gathering a domain-specific corpus and learning word associations.

Trends over a season of TV shows

If I plot the number of downloads per episode over the length of an anime series, are there interesting trends? I’m using the data and estimation methods from Over 9000 and graphing the number of downloads for each episode 7 days from when the torrent is available.

Finding a learning curve for Over 9000

For Over 9000 I’m estimating the number of torrent downloads per show/episode at 7 days from release. If I have enough data I can compute that by interpolating points. But usually I need to extrapolate from the first few days of downloads.

Curve fitting and machine learning for Over 9000

One of my current projects is Over 9000, a visualization that shows which anime series are currently popular. I get the data by scraping a popular anime torrent site every day and come up with a single number that represents the popularity of a show.

Synonyms for factoid search, Part 2

The previous post described the problem and attempted to find synonyms for Elastic Search using a thesaurus, Wikipedia redirect groups, and Bing’s related searches.

Synonyms for factoid search: Part 1

A while back I worked on a potential startup project called Searchify and needed to generate domain-specific synonyms. I’m finally getting around to writing that up but there may be some holes in my memory or notes.

REST API design tips

For the server side of Pollable we iterated on the REST API several times and learned a lot the hard way. And this excellent article is great for design tips.

Form validation on Mechanical Turk

Amazon’s Mechanical Turk is a great system for surveys, writing short snippets, tagging images, and other small tasks. But you spend your time on the littlest things.

Mechanical Turk tips for beginners

I’ve been using Amazon’s Mechanical Turk for side projects, whether it’s for annotating data, generating short-form content, or evaluating subjective quality.  When I started I had several misconceptions, found a lot of great info, and learned many lessons the hard way.

Projecting the number of downloads for torrents

One of my current projects is Over 9000, a visualization that shows which anime series are currently popular. I get the data by scraping a popular anime torrent site every day and come up with a single number that represents the popularity of a show.

Thoughts on Python

Python is a great language: It’s simple and popular enough to have excellent APIs. A couple years ago I switched over from mostly Perl-based scientific code to Python. Since then I’ve been mostly in Python with some digressions in Java/Hadoop and C. Although Python is popular for web frameworks, my usage is...

Tips on moving from grad school to industry

After finishing my PhD and teaching for a semester, I joined Swype as a Software Engineer. My focus was on researching and developing language modeling improvements; it wasn’t just coding away.

ParseTreeApplication on Github

It’s tougher to blog or publish articles in a professional research and development role; most of the work isn’t public. If I’m not careful, side projects get the “leftover” time in my schedule.

cpu temps + dust = ???

Recently I built a new system for myself and as usual I bought a tube of thermal paste.  Now what do I do with it?  Typically I let it sit in the closet for another 3 years, then wonder whether it loses its magic, then order a new tube anyway....

computational complexity + ??? = reality

This is an opinion piece on computational complexity or Big-O notation. There are two sides that I’ve experienced - teaching and software development. To some extent they’re tied - teaching should prepare students for real-world use of complexity analysis. I’ll start with software development then transition to teaching. I’ve adapted...

cv + latex/cs + phd = ???

So you’re finishing up your PhD (or thinking about jobs) and you need to make a CV.  I remember this situation and I felt lost.  Here are some notes for beginners:

storing a word list

It’s been a while; my apologies. Teaching combined with job hunting and (attempted) research takes more than I expected. I suppose that makes me an optimist?

cv stuff

If using LaTeX for your vita, it’s a good idea to put each of the sections in different files and include with input commands. The advantage is that now you can have several versions of the same CV linked to the same underlying data. For example, a teaching CV would...

Switchboard stats

I’ve been using the Switchboard corpus for years and I recently gave a talk in class with some statistics, including simple tests to compare Switchboard to a background corpus. In this case, I used Google’s Web 1T unigram model for “general purpose statistics”.

ambitions

I was re-reading The Bourne Supremacy recently and this quote struck me:

Sarah Connor Chronicles from an AI Perspective

A bit of departure from my usual, but I thought it’d be interesting to review Terminator:  Sarah Connor Chronicles from the perspective of an AI researcher. I’ll go through the first season in this post and the second season at another time.

comments: Technology, Conferences, and Community

Jonathan Grudin recently published an article on conference/journal culture in computer science entitled Technology, Conferences, and Community in Communications of the ACM. Unfortunately it’s behind a paywall, but I’m sure there’s fulltext out there via Google or Google Scholar. I’ll give a (hopefully short) summary and then some commentary.

Wolfram on Watson, etc

As the semester approaches my free time and energy have been dwindling, so I’ll probably be terse for a while.

viterbi search for re-capitalization

Last week I mentioned that I was working to restore the capitalization of graph titles.  At the time, I was using Google’s trillion word unigram model to disambiguate case.  But it didn’t work quite right.  It seemed like an improvement over just lowercasing them, but you’d end up with words...

Improving our Reviewing Processes

It looks like the new issue of CL Journal is out today, and Inderjeet Mani has an interesting Last Words article about reviewing.  It’s only 4 pages, but I’ll summarize it and then comment.

title case + pos tagging = nnp

This year I plan to make up for lost time in publication; while finishing my dissertation I was still doing research but didn’t get around to publishing it.  Now I’m in the process of doing that and I thought I’d also take a look at some work from my early...

IBM's Watson

In case you haven’t heard of it yet, IBM developed a deep QA system called Watson.  They’ve had a medium-sized team working on it for about 4 years.  Among others is Jennifer Chu-Carroll, who received her Ph.D. from the University of Delaware under Sandee Carberry (who was part of my...

parse trees + visualization = ???

There have been many points in my research career when I realized that I would understand the problem better if someone had visualized my data somehow.  Usually, that means I write some simple visualization that’s close enough to quickly answer my questions about the data.  For example, I’ve compared stemmers/lemmatizers...

it's 2011 and

unrealistic, unexpected unprogress

skepticism of scientific findings

The New Yorker has an interesting piece on reproducibility in science, citing several biomedical studies that showed smaller and smaller effects over time in re-testing.  They also tests run by multiple groups where some found significance and some not.  Some researchers found huge effects and some not.

word scramble problem

Sorry for the delay in posting something — I’ll be more active again once I get over this cold.  Here’s something from my Drafts folder in the meantime.

google books ngram viewer (part 2)

Last week, I covered the release of the Google Books NGram Viewer.  Google’s handy tool has received a lot of attention since then, which I’ll attempt to summarize and I’ll add some more information.

google books ngram viewer

Today, Google released the Google Books NGram Viewer, which is a beautiful frontend to a historical ngram model.  They have a separate ngram model for each year and for each language type (English, American English, British English, Simplified Chinese, etc).

word prediction + ??? = google scribe

There’s a South Park episode that spawned the quote “Simpsons did it.”  In research, I’m starting to feel like we could say “Google did it.”  I’m talking about Google Scribe, which brings word prediction (my research area) to the masses.

unlearning how to read and write

We’ve been reading for so long that it’s a very internal process.  For example, while I’m reading I might not realize that I’m struggling to read a long string of prepositional phrases;  I might just have the vague feeling that it’s difficult.  Fortunately, with conscious effort, it becomes easier and...

what a phd means...

I thought it might be interesting to make a (semi-funny) bulleted list about what a Ph.D. is.

style-check.rb

I’m very interested in tools to help write papers, specifically automatic proofreaders.  I recently came across the tool style-check.rb, which helps proofread LaTeX documents.  It doesn’t really have a name per se, just the filename style-check.rb.

nlp and statistical significance

I wanted to jot some notes on statistical significance as a follow-up to the evaluation post.  But I tried to write for a wider audience, and it ended up being huge and incomplete.  In this post I’ll describe some of the oddities of using statistical significance for NLP research, and...

evaluation in NLP

Evaluation is the concrete specification of your research goals.

brief history of ASR

This is the first part of a two-part article on evaluation in natural language processing.  For the first part, I’d like to focus on Fred Jelinek’s ACL Lifetime Achievement Award, which has a corresponding publication in Computational Linguistics entitled ACL Lifetime Achievement Award: The Dawn of Statistical ASR and MT.

latexdiff + version control = ???

In the past, I’ve used latexdiff to show the changes between different versions of a paper.  But it’s a pain to keep the old version around.  Or if I’m using a version control system I have to remember the right version number (and remember the commands to retrieve old versions...

comparing stemmers

Natural language understanding typically involves a huge chain of processing and errors can happen anywhere, then get propagated to your system which operates at the top of the stack.  I’ll give an example from my thesis work - adapting a language model to the style of text:

T-2 days until defense

I remember my first conference presentation.  It was at ASSETS in Tempe, Az.  2007?  My voice was wavering because I was so nervous doing my first public talk in front of maybe a hundred people.  I have a low voice so the audience couldn’t tell (or they were really polite)....

the words argument...

I came across an opinion story at The Chronicle today about how we tend to focus on war more than peace.  I read the Language Log often and they cringe at the pop media “words arguments”:  language X has more words for Y.  Notably, the Eskimo snow argument crops up often....

latex tips and tricks

LaTeX takes some getting used to, but now that I’m used to it I (usually) love it.  Part of the initial difficulty lies in understanding the syntax and how it affects typesetting, which I won’t discuss in this post.  Some of that is better now in our department - they...

expectations

I’ve long felt that many people issues are the results of a mismatch between expectations and reality.  This is especially true in policies (whether it be clubs or classes).  For example, someone expects to get an A on an assignment, but they get a B instead.  Generally they become angry...

research poster presentations

It varies by field, but poster presentations are fairly common in computer science.  Many conferences typically split things into full papers with traditional presentations and short papers with poster presentations (although I recall ACL was debating making the presentation decision independent of the paper length).  The peer review process varies...

more IR

I came across an interesting IR post by Dan Lemire sometime earlier in the week that I meant to post.  He compares searching for “Kurt Gödel” with “Kurt Goedel” and in the comments “Kurt Godel”.  Google returns different results for the first two but Bing doesn’t.  The comments say “Gödel”...

hiding emails from web bots

Inevitably your email address will be somewhere on the web and a web bot will scan that webpage, extract the emails, and add them to a big list for spammers.  In response, some people spell out their email address like “trnka at udel dot com”.

faking dynamic information display

Do you ever need to display a dynamically updating value on the web?  (And not have a lot of time?)  If we’re talking about a basic daemon or long-running script, there’s a quick way to address the problem:  periodically write the output to an html file.  The problem is that...

google experiments

Sometimes you need relative frequencies of words/terms and you don’t already have a Perl script for your data.  Google experiments are a quick approximation - just search for the sequence of words in quotes and use the number of page results (at the top) as an approximate frequency.

the search for spock

I’ve been making too many systems posts and I’d like to better balance those things with more researchy things.  Also, I meant to type “truth” instead of “spock” but I couldn’t resist and it’s close enough anyway.

long-running processes

Some of my simulations take a while to run, so I have a couple of servers that I ssh into and run them on.  But there are a host of little things I didn’t know about at first:

basics of bibtex

Like most authors in my area, I use LaTeX for writing papers and BibTeX for citations in those papers.  I remember how daunting it felt when I had just started though. At someone’s suggestion, I started by looking at others’ LaTeX and BibTeX, which helped, but I also needed structured...

Santa Barbara + character set = :(

Although I’d been thinking about starting a blog at several times, an issue I had yesterday with the Santa Barbara Corpus of Spoken American English (SBCSAE) really motivated me.  In general, I have Perl scripts to simulate typing with word prediction and then when I want to compute statistical significance,...