On the second episode of The David Brunow Podcast I talked about difficulties I’m having with my Dharma Talks app. I’d like to follow up that discussion with some examples that would be hard to share over audio only.
The problem is that I’m having trouble getting the proper information from the Mission Dharma website to display in the app. On the surface, this seems like a simple issue of taking the information that’s on the website, figuring out which parts of it mean what, and then storing it in the right places in the app. For example, if I can find an anchor tag, represented in HTML as
<a href="/talks/1">March 21, 2015: Talk Title, Talk Speaker</a>
then I can take the text inside that tag as the raw Talk data. In this example, I can find the date by looking for the colon and taking everything before that colon and storing it as the date. I can find the title of the Talk by taking everything after the colon, looking for a comma, and then storing everything after the colon and before the comma as the title of the Talk. The rest of the text inside the tag is the speaker for that Talk. The URL to download the Talk is the easiest part because it is the full contents of the href tag.
This seems pretty easy, problem solved! But not really. This is where the fragility of HTML scraping becomes a problem. This is why most people would recommend against HTML scraping and I would definitely agree with those people if there is any other, better way to solve the problem. Or if the system you are building needs to be 100% reliable. But most systems don’t and I believe that Dharma Talks fits into that category.
I’ll give an example of the fragility. Here is an anchor tag that I ran into while I was initially developing Dharma Talks:
<a href="/talks/1">March 21, 2015, Talk Title, Talk Speaker</a>
Do you see the difference? It’s subtle, but that comma after the year will completely ruin the parsing I described before. But while making the first version of Dharma Talks I’d seen this pattern so I adjusted my parsing for it. First I look for a colon, and if there is one then I do the parsing I described before. If there isn’t a colon then I find the second comma in the text inside the anchor tag and treat it the same as I did the colon earlier.
This worked for over a year other than a couple of minor issues where the person that maintains Mission Dharma’s website changed the text inside the anchor tag after the app had parsed it, resulting in two of the exact same episode. At that time I hadn’t wanted to limit the parsing to any sort of uniqueness because I wanted it to be able to parse things as generally as possible – I feared any restrictions would make it more fragile.
At this time the app only relied upon the Mission Dharma website being available for it to get new episodes. All the parsing was done inside the app and I had no control over what episodes were showing on someone’s phone. I couldn’t remove a duplicate episode that had had its title changed. I couldn’t fix a parsing error if the person maintaining Mission Dharma’s website didn’t follow one of the two patterns I described above. I liked the simplicity of only needing the app and the website, but I hated how fragile the whole thing was. I hated that my name was on something that looked poorly made. The line between simple and shoddy is thin.
Life got in the way and I didn’t work on the app for a year. When I did get time I noticed that I had taken a shortcut in building the app and I had hard-coded the years to be parsed. By hard-coding, I mean that I had told the app to parse 2010, 2011, 2012, 2013, and 2014. You can probably see the problem there. Hard-coding is bad and I know that, but I’m guessing I saw that as a very short term solution so I let it slide. That’s a good lesson for the future.
So I fixed the hard-coding so it looks at the current year and parses starting at that year all the way back to 2010 which should make that part of the app future-proof. I submitted the update to Apple, it was approved, and things worked. For about a week.
Up above when I was describing the ways the parsing could break I only talked about the text inside the anchor tag. That’s the nice way things can break. The bad way is if there is no anchor tag. For example, a simple typo could make this:
<a href="/talks/1"March 21, 2015: Talk Title, Talk Speaker</a>
Do you see the problem? The app would see it instantly and the parsing would break immediately. The opening anchor tag doesn’t have its closing ‘>’ so it isn’t truly an anchor tag. This is what happened about a week after I released the new version of the app. New Talks stopped being added to the app.
I’d been thinking about having an application on my server that does the parsing and stores it in a database for a while. That would give me the control I needed to make sure that the Talks were loaded in the app correctly. Since the database was on my server I could change any entries that got parsed incorrectly. If things were too broken on the Mission Dharma site I could manually add Talks to the database. I could finally guarantee uniqueness for each Talk because I had total control of the data in the database. Ultimately, it meant that I could make sure that the information for each Talk was correct.
Why didn’t I do this a year ago when I first released the app? A few reasons. I wanted the app to be able to live on without me and without my server. I wanted the components to be simple. And I didn’t have experience with any programming language that I could run on my server. All my work had been on Windows servers and I had a Linux server. I still think these are valid reasons to make the decision I made, but if I had it to do over again I would have done it differently. I would have implemented the solution that I just did – a server application that contacts the Mission Dharma website for Talks to parse and an iOS app that gets the Talk information from that server application.
That solution went live today.