Today we continued to look into how to store the data we will be mining. We looked at our schema and decided it would be a good idea to abstract out our “visit” entity. A visit is going to be a series of hits (page views). They will be related to each other because they’re from the same visitor. We can link them together and to a user either by using the user’s login session or with cookies if they aren’t logged in. Instead of worrying at this phase about what constitutes a visit (Is it when they haven’t returned for another hit in over a half hour? Is it that a visit ends when they close their browser window?), we can just calculate the visits later on with parameters. We can link the hits together with these parameters to present the visits given those parameters. This may for example be an SQL view instead of the results of querying an SQL table.
After learning that the client would prefer we not use Postgres and instead use MySQL or SQL Server, we decided to benchmark MySQL, specifically how quickly it can read JSON data types. It turns out modern MySQL also supports JSON as a native type. It can even do indexes on JSON columns to speed up read queries. Coupled with the fact it is also free and open source, we’re mainly looking at it at this point for our back end. Given the log data we got from the client, where we learned how many unique visitors they get per month, we ran some scenarios to find out how many queries per second our system would need to handle. We’re still well within doable range. We’re confident we’ll be able to get a system ready that’s capable of handling the queries, with room to grow for the future.