Model inference has been developed over years for various purposes: reverse engineering existing systems; automata learning in order to do predictions; application monitoring ; etc. In particular it has been used to do regression testing on web-based applications. The idea is to reverse engineer a usage model reflecting the current usages of the application to test a new release of the application. This is particularly useful in web-based applications since the navigation from link to link is not always obvious. For instance, a user may directly access some resources using a direct URL, or the previous and forward buttons in most of the web browsers allows navigation that are not reflected by the links of a web page, etc. This is why usage models have gained in popularity in web-based application testing and monitoring over the past few years.
There exists several techniques to infer a model: the Pautomac competition assess several of them every year. Unfortunately, the considered models used in the competition do not consider large inputs. As described by Verwer et al., the most efficient way to treat real world models without loosing too much precision is the N-gram inference technique. N-gram inference basically build a usage model (i.e., a Markov chain) from a set of user sessions. A User session is a sequences of web pages, visited by the user. A n-gram inference will build a usage model to predict when in a given state, which will be the next state, based on the n-1 previous states. N-grams have been used with success to build web based applications usage models in the past. As stated by Sprenkle et al., the growth of the model will depend on the chosen 'n' but also on the size of the user sessions. Indeed, if the n is high and the uses sessions small, the model will not growth further for a higher value n+1. In our case, we choose to use, like Ghezzi et al., 2-gram model inference in order to generate our usage model.
Web-based application usage models are learned from user sessions. Those sessions are usually recorded on the web server (e.g., in the Apache web log in our case) as a list of HTTP requests received by the server. We choose to define a user session as a sequence of HTTP request coming from the same IP address with a time-frame of x minutes after the last request. I.e., if the same IP sends another request during the x minutes following its last request, we consider this new request as part of the user session. During our generation process we choose a time-frame of 45 minutes.
To build the usage model, there are (mainly) 3 different ways to represent the HTTP requests which will lead to 3 ways to define the transitions between states:
The size of the model will depend on the chosen request representation. The RRNV representation may be too fine grained and the generated model may be too specific.