WikiJournal Preprints/Algorithms for Categorical-Generative Analysis: Implementing an Inductive, Comparative Method for Social Processes based on Formal Language Theory

Introduction
The concept of structure is in the foundations of science, mainly of the social sciences. Because of the difficulty to make controlled experiments to generate empirical data, the social research design usually counts upon theoretically guided models and deductive methods of analysis based on the nature of data being quantitative or qualitative. On the one hand, quantitative researchers rely upon the objectivist assumption of directly observable phenomena, but they still need to consider a statistical model compatible with the distributional structure underlying empirical data for a random sample, such that accurate predictions can become possible. On the other hand, qualitative researchers reject objectivism because of the difficulty of measurement and the dynamics of change in unboservable phenomena, even though they often need to assume a theoretical framework compatible with the generative structure underlying manifested evidence for selected instances, such that plausible explanations can become possible. Of course, there is even more diversity among the methods of social analysis than this dichotomy suggests, but this is the starting point of the most popular methodological typology. In the social sciences, researchers rely upon four major scientific methods : the statistical, the experimental, the case study, and the comparative. They make structural assumptions to test theoretical relations between either concepts or variables regarded as important to explain the phenomenon. The exam on the assumptions about structure is often absent from the research protocols in empiricist studies. Nonetheless, this question becomes fundamental to reliability when studying complex, dynamic and contingent phenomena: what is under inquiry is not an empirical data set but the structural mechanisms that generated it. Furthermore, among these four research approaches, the comparative method is the one in which this issue is more critical to its reliability because there is a specific social structure for each empirical setting.

The comparative method was originally applicable to the studying of the historical evolution of languages. It consists of the analytical comparison of both common and discordant features between distinct languages from the same linguistic group. In other social sciences, the comparative method is a historical and empirical approach to the comparison of distinct instances from the same class of social phenomenon: they share the same general structure but diverge on their evolutionary path because of the specificities (i.e., the context) found on each empirical setting. For instance, organizations, institutions, political regimes and countries are objects of inquiry commonly tackled using the comparative approach.

The comparative method shares some features with the case study method when both focus on overstressing the context of an instance of the phenomenon. Its implementation often seems to be a multiple case study due to the systematic selection of pairs of cases. The comparative method can still rely on analytical techniques based on formal models like the statistical and the experimental methods, even though the choice for the comparative method often takes place when neither the statistical nor the experimental methods are applicable due to spatial and/or temporal heterogeneity in data.

Qualitative Comparative Analysis (QCA) is a technique to recognize configurations of values for a set of categorical variables using empirical data. In the social sciences, particularly in Sociology and Political Science, research based on QCA is also known as configurational analysis in contrast with statistical analysis. The latter approach is still “comparative” in the way it contrasts many entities, but there are key differences. On the one hand, statistical analysis relies on Probability Theory to assess the net-effects of common features of a population-wide phenomenon: it implements a variance-based analytical method over randomly sampled ordinary instances. On the other hand, configurational analysis relies on Set Theory to assess the powers and liabilities of both common and discordant features of a context-specific phenomenon: it implements a membership-based analytical method over systematically selected representative cases. Today, configurational analysis is nearly a synonym of comparative method, and QCA is a tool implementing it, although the term “comparative method” originally refers to a historical research approach that dates back QCA.

The goal of QCA’s configuration analysis is to test a deterministic hypothesis about the membership relation between categorical attributes under the assumption of error-free measures. The critics claim that the situations in which QCA is applicable are rare because of a triad of assumptions that are naïve simplifications regarding the complicateness of the objects of studying – particularly when working with a collection of cases. First, the absence of measurement errors, which is an improbable assumption at all. Second, the inductive approach to theory formulation from raw data – except for experimental studies, theory is indispensable to build up a model supporting data collection and analysis. Third, the resulting deterministic membership relation between categorical attributes (i.e., finite sets of discrete elements rather than dimensions of a numerical function) is a context-specific explanation – there is no net-effects between variables. Indeed, the use of a configurational analysis technique in substitution for a statistical analysis technique is a huge mistake that many researchers using QCA have incurred under the “small-N sample” rhetoric: a minimum sample size is always necessary to provide valid generalizations to a population.

Fortunately, the situations in which a set-theoretic method is applicable are not rare in the social sciences: whenever an empirical data set results from the same concrete social structure, that is, a specific instance of social system, QCA is quite applicable. In this sense, learning patterns of behavior that are contingent to a specific empirical setting is the common goal of research works using the comparative approach in many fields of the social sciences. At first, the worth of applying QCA is a matter of understanding both the inference algorithm and the assumptions about the structure underlying the phenomenon under inquiry. After all, the right tool is the one applied to the right problem whatever the type of the method in use.

This paper argues that the polemic around QCA is actually a matter of the epistemology under consideration rather than an ordinary mistake. It proposes to consider QCA as a technique for an alternative type of classification analysis, which relies on a different philosophy of how to produce scientific knowledge in substitution for the mainstream epistemological paradigm, which is Social Positivism. Known as a post-positivist paradigm, Critical Realism does not deny Social Positivism, but it extends the mainstream set of assumptions and expands what is in the scope of scientific inquiry. There is a necessity to replace the assumption of epistemic objectivism and the falsification logic from Social Positivism, which is often taken for granted by the majority of social researchers, with the assumption of epistemic relativism and the retroduction logic from Critical Realism. This paradigm shift helps scholars to reconcile a widely acknowledged theoretical proposition about the phenomenon under inquiry with a surprising or anomalous fact instead of either rejecting the former or ignoring the latter.

In addition, the present paper introduces a retroductive analytical procedure that is equivalent to a classification algorithm much like those in use for the practice of normal science; however, relying on configurational rather than statistical reasoning. Firstly, this work presents the classification problem through the two main approaches to solve it: one relying upon Information Theory, described in the form of the decision-tree learning model, and other relying on Probability Theory, described in the form of the logistic regression model. Next, the paper suggests that configuration analysis is a kind of classification analysis based on algorithms working with combinational logic instead of probabilities. The first ever used configurational classification algorithm makes the simplification of a logic function, relying upon Set Theory – it was used in the original QCA tool. Other algorithm may be regarded such as an extension to the first, but relying upon Formal Language Theory instead – in its turn, an extension to Set Theory. The application of this second algorithm to learn a path-dependent, sequential logic from empirical data, providing a new configurational approach to the classification problem, is the main contribution of this paper.

The Classification Problem
Consider a collection of instances of a class described by a fixed set of attributes, in which the values for one of these attributes, named class, depend on the values for the other attributes altogether. The class attribute delimits mutually exclusive partitions of the instance space. Consequently, different configurations of the values of the other attributes can determine the same class, or are associated with each value for class. The procedure for recognition of patterns in the values of the attributes discloses the partitioning of the instance space into the given set of classes: it solves a supervised learning problem in data science, known as classification problem. Any algorithm solving this problem, which is known as classifier, may employ different data analysis techniques to produce a classification model, guessing a type of concrete structure that generated the data under analysis. The resulting classification model can predict the class of an instance relying only on the values for its descriptive attributes. This work discusses the two main groups of classifiers in order to introduce a new group of classifiers that relies on a different philosophy of how to produce scientific knowledge - under the assumptions of the epistemological paradigm known as Critical Realism in substitution for those of the mainstream epistemological paradigm, which is Social Positivism. Known as a post-positivist paradigm, Critical Realism does not deny Social Positivism, but it extends the mainstream set of assumptions and expands what is in the scope of scientific inquiry.

The next sections present the two main classification algorithms to enable readers to compare them with other two algorithms suggested to implement the proposed third approach to the classification problem. In the first section, classifiers based on Information Theory implement the deterministic partitioning of the instance space. Next, classifiers based on Probability Theory calculate the chance of a given instance fitting to each class in place of predicting a single one. Finally, two classifiers based on Set Theory and Formal Language Theory make the deterministic partitioning of an instance space relying upon the relations of necessity and sufficiency between the attributes from representative instances rather than relying on the relation of covariance between the attributes from sample instances like other classifiers do.

Classification based on Information Theory
The simplest classifier of this group, the ID3 algorithm can build up a classification structure such as a decision tree from the attribute data using a recursive, “greedy” approach. After selecting the more important descriptive attribute, ID3 makes the partitioning of the instance space for each possible outcome. Then, it recursively repeats this operation for each partition until there are no attributes left unselected. The criterion for attribute selection is based upon the concept of entropy from Claude Shannon's Information Theory : a measure of uncertainty or disorder involved in the value of a random variable or the outcome of a stochastic process. Consider the minimization of the entropy H(S) in each partition S, given by H(S) = $$\sum _{\mathit{cC}}-p\left(c\right).{\log }_{2}p\left(c\right)$$: it gradually reduces the amount of heterogeneity until all instances in a partition belong to the same class (c $$\in$$ C). Alternatively, consider the maximization of the information gain IG(A,S), a measure of the difference in the entropy from before to after the partition S: it is split again on an attribute A, given by IG(S, A) = H(S) – H(S|A), where conditional entropy is given by H(S|A) = $$\sum _{\mathit{cC},\mathit{aA}}-p\left(c,a\right).{\log }_{2}\frac{p\left(c,a\right)}{p\left(a\right)}$$.

An extension of the ID3 algorithm, the C4.5 algorithm can build up a classification structure, but in the form of a ruleset; each path from the root node to a leaf is turned into a rule whose conditions are the possible values for each attribute along the nodes in the path. Although there are other technical improvements in relation to ID3, this is the main difference between these algorithms.

All classification algorithms based upon the concept of information entropy produce the deterministic partitioning of the instance space into n-dimensional hyper-rectangles, where n is the number of attributes in addition to the class attribute. There can be many partitions as the level of disorder of the instance space imposes, which can turn into a problem of over-fitting. Some pruning techniques are applicable to avoid this problem, but it may also be the case of using another type of classification algorithm.

Classification based on Probability Theory
Other type of classification algorithm relies on binary logistic regression. Consider the logistic function f(x) = $$\frac{L}{1+{e}^{-k\left(x-{x}_{0}\right)}}$$, which is a sigmoid function, where L is the maximum value of x, k is the logistic growth rate, and x0 is the midpoint such that its curve has a “S” shape around the midpoint. The standard logistic function is f(x) = $$\frac{1}{1+{e}^{-x}}=\frac{{e}^{x}}{{e}^{x}+1}$$, that is, a logistic function with L=1, k=1, and x0 =0.

In statistics, the logistic or logit model is the regression model with a logistic function (or any other sigmoid function) modeling a dichotomous or binary dependent variable; even though the independent variables can assume any real value such that P(Y=1 | X) = $${\beta }_{0}+{\beta }_{1}X$$. In this sense, the logistic regression analysis is the procedure to estimate the β-coefficients of a logistic model given by the equation P(Y=1 | X) = $$\frac{{e}^{0+1X}}{{e}^{0+1X}+1}$$$$\therefore$$$$\frac{P\left(Y=1\vee X\right)}{1-P\left(Y=1\vee X\right)}={e}^{0+1X}$$ $$\therefore$$ $$\ln \left(\frac{P\left(Y=1\vee X\right)}{1-P\left(Y=1\vee X\right)}\right)={\beta }_{0}+{\beta }_{1}X$$ using the maximum likelihood estimator. It applies the maximum-likelihood ratio to predict the statistical significance of the random variables that take part of the logistic regression equation. An extension of this model with a categorical dependent variable is a multinomial logistic regression.

Why is it the case for the logistic regression instead of linear regression in order to perform classification analysis? First, linear regression relies on a numerical dependent variable while logistic regression uses a categorical dependent variable. There exists no ordering relation between the values of the mutually exclusive classes, but there is order among the elements in the numerical codomain. Reordering the list of classes using the linear regression model would result in a different relation each time. As classification analysis needs a categorical dependent variable, it would be unconceivable to use the linear regression model; except for the dichotomous case, in which the order does not matter anymore because there are only two possible values in the codomain. There is an important consequence of adopting each of the classification approaches. On the one hand, using an entropy-based classification method, the partitioning of the instance space is a hyper-rectangle, implying in one linear boundary for each attribute perpendicular to its axis. On the other hand, a logistic classification makes one smooth linear boundary between the instances of distinct classes, which can be of any direction. Thus, the choice for which classification model to use relies on the disposition of the instances of the classes in the space under analysis. Logistic classification works better if there may be a single functional decision boundary between the groups of instances of distinct classes in a high dimensional space. The classifiers using entropy work better if there is a dispositional pattern between the groups of instances necessitating multiple decision rules with small number of attributes (less than fifteen); nevertheless, having far too many degrees of freedom makes it also more prone to over-fitting.

The logistic regression model outputs deterministic relations too. If the estimation of the coefficients of a regression equation reaches a high confidence, then prediction is highly accurate, which is only possible on truly deterministic systems. Nevertheless, regression analysis can assess unsystematic errors of measurement, which improves the confidence of the estimation. This is a critical flaw in the entropic algorithms, which can only assess errors of misclassification.

Classification based on Set Theory
Configuration is any statement in which its attributes depend on the relative disposition of every constituent part; that is, the arrangement of things put together that explains the observed result as a whole. The relationship between the whole and the structure of its component parts is comprehensible using a systemic perspective: the configuration of interrelationships between constituent entities yields an emergent form such that the properties of the whole are not a direct result from the properties of the parts. This is a key idea not only in fields of hard science such as algebra and chemistry, but on fields of social sciences too. In addition, the scientists are also concerned with configurations evolving over time: patterns of individual or collective behavior observed when studying a single social system or multiple social systems with the same basic structure. These patterns emerge over a historical period of observation, and even across distinct empirical settings in the case of multiple systems. QCA proponents claim this technique to be a configurational analysis tool, which unveils functionally equivalent causal paths to the outcome of interest in a supposedly complex phenomenon. However, QCA lacks the features to acknowledge complex, dynamic and contingent patterns of sequences of events that take place over space and time. The rationality behind QCA depends on combinational logic only. To acknowledge patterns in chains of events it is necessary a sequential logic – the one acknowledged by a finite automaton, which is the first level of algorithmic complexity just above combinational logic, which is that of regular languages. Even though both logics are compatible – sequential logic is a kind of extension of combinational logic –, there are still higher levels of computational complexity above the regular languages that are associated to complex, dynamic and contingent social phenomena. In addition, there is also the need to adopt an analytical procedure beyond the a-historical configurational analysis of QCA without leaving behind the accumulated expertise of QCA researchers. There are alternatives for adding temporality to QCA, but they are still constrained to the sequential logic of regular languages.

This paper develops a new methodology of classification analysis to solve this challenge that still uses csQCA. Contrary to the hypothetical-deductive approach to inquiry, csQCA manipulates a set of representative instances of a class of social form rather than ordinary samples from the population of its instances. In the next section, the algorithm used in csQCA is presented and defined as a classifier based on Set Theory. In this sense, the present paper suggests an extension to this method that rises the classification problem to even higher levels of computational complexity.

Rule Induction using the Quine-McCluskey Algorithm
Consider the logic function f(t1, …, tn), also called binary or Boolean function, as any function with the form f: Bn → B defined on an n-dimensional domain Bn, which are n logic terms representable as either {0, 1} or {true, false}. Consider also the problem of finding out its corresponding formula like a sum-of-products of terms in the Canonical Disjunctive Normal Form (DNF), known as Minterm Canonical Form. Given this logic function f as a truth table, there are 2n possible combinations of each term or its negation, known as minterms. An implicant of this function f is any minterm mi such that if its value is “true,” then the value of the function f is also “true;” in a way, that mi logically implies f. Any representation of the logic function f by a formula uses the sum of minterms that are implicants. If at least one of the implicants is logically equivalent to a combination of other implicants, then it is not the representation with the minimum number of minterms. In this circumstance, there is a need to minimize the number of input terms in a sum of minterms. For the simplification of the formula of a logic function, it is possible to use algebraic manipulation based on the reduction principle, given by A.B + ~A.B = (A + ~A).B = B, combining pairs of implicants into a new implicant, and then reducing the number of minterms too. Whenever no other combination of implicants is possible, the resulting formula consists of the sum of all prime implicants of the logic function. If there is at least one prime implicant resulting from a combination of minterms that also occur in another prime implicant, then it is not an essential prime implicant ; eliminating it from the formula of the function is necessary. The simplest formula must be a combination of essential prime implicants only.

The Quine-McCluskey (QMC) algorithm was the first systematic procedure for the application of the Method of Prime Implicants for the minimization of logic functions. Given a logic function f in the form of a truth-table, it finds out all implicants of f. Next, it recursively uses the reduction principle to transform pairs of implicants with only one divergent term (i.e., B versus ~B) between them into a single one until there is only prime implicants left. Finally, the additional elimination of redundant prime implicants reduces their number to the essential prime implicants, resulting the simplest formula for the logic function f.

Usually, engineers apply QMC for the design of digital circuits, but the proponents of configuration analysis apply it for empirical social research. The csQCA calculates all combinations of attributes leading to each outcome of a type of class event as essential prime implicants of the logic function underlying the empirical data. There are versions for nominal variables and ordinal variables, which is the multi-value minimization QCA approach (mvQCA). Tosmana is a tool that implements this categorical variable-based approach generalizing the dichotomous variable-based one.

The QMC algorithm builds a linear function model on a domain set of dichotomous variables. In csQCA, the assessment of the resulting model relies on two fitting criteria: the measure of coverage provides the percentage of all observations having a specific configuration of conditions; and the measure of consistency provides the percentage of observations in which there is a specific configuration of conditions having the same event outcome.

Once built a truth table with empirical data, crisp-set QCA calculates a measure of coverage for each configuration of conditions; it determines which of them is dominant (or exceptional) in comparison with the other alternative implicants of the same event outcome. The csQCA tool calculates a measure of consistency for each configuration of attribute conditions empirically observed as well. The perfect consistency refers to the situation in which all instances of each configuration of conditions result into the same event outcome. Otherwise, there are contradictory lines in the truth table; that is, the same configuration of conditions is in association with different event outcomes, which relates to a measure of consistency between zero and one. This is evidence of the current inability of the model to explain the phenomenon perfectly.

QCA proponents handle contradictory lines by adding or removing one condition of a configuration, in a way that resembles an abductive inference. The goal is to explain the anomalous fact – one of the outcomes of the class event. A third solution is to accept the contradictory rules based on either the theoretical or the substantive knowledge of the scientist; however, it is not different of acknowledging a specific (but unknown) context. After eliminating or accepting all contradictions, the next step in this procedure of configurational analysis is to simplify the conjunctions of attribute conditions using the Method of Prime Implicants.

For this application, QMC can test the hypothesis of a deterministic set-theoretic relationship between the input terms and the output of interest. It is only valid for the set of instances under inquiry because there is no assessment of the appropriateness of the sample size for statistical inference. QMC builds up a classification model based on a specific pattern for the values of the attribute conditions from a set of representative instances. Selecting instances in a systematic way instead of randomly sampling them is necessary to avoid bias, reflecting the variety of configurations of conditions leading to the same event outcome; however, it does not reflect the frequency of these configurations in the population. In other words, the number of times each configuration of conditions occurs in the selection of instances does not matter in QCA. In addition, statistical generalization of the results to the whole population is not possible; only analytical generalization to the theory is valid. The number of instances is irrelevant to the reliability of the conclusions: using the QMC algorithm like a set-theoretic classifier, it is possible to directly manipulate the relations between theoretical concepts rather than indirectly manipulating them by processing sample data. The membership relation between elements of the sets of event outcomes is under analysis.

Classification based on Formal Language Theory
Historical analysis of configurations relies on the selection of instances of a category of behavior, that is, the structure of membership relations between types of events that occur over time in sequences. First, all instances of the same type of event share a common definition: description is feasible by the observation of values for some fixed, finite set of attributes. After that, clustering the instance space produces groups of instances presenting distinct patterns of values for their attributes related to mutually exclusive outcomes of this type of event. Finally, sequences of event outcomes occur alternatively in a systematic way such that one instance relates with some of the others according to a complex but deterministic pattern of co-occurrence of these event outcomes. The computer-based implementation of this historical approach to configurational analysis is possible relying on Formal Language Theory, which is the studying of any set of sentences from such a formal language. Any sentence is a finite sequence of discrete symbols, also known as string, resulting from the concatenation function (•) defined over some fixed, finite set of symbols, called alphabet (Σ). A formal language L is the free monoid (Σ, •) over the alphabet set Σ together with the concatenation function • : Σ* × Σ* → Σ* that generates any string of L from a pair of substrings, including the empty string. In this sense, the grammar G = (N, Σ, P, S) is a formal model where: N is the finite set of system states in which the abstract machine can transit during the path of derivation of a string belonging to the language L(G); Σ is the alphabet set; P is the ruleset; and S is the initial system state.

It is possible to understand Formal Language Theory as an extension of Set Theory; nevertheless, using a hierarchy of levels of complexity for the concatenation function called Chomsky Hierarchy. This hierarchy of languages is equivalent to both the hierarchy of abstract machines from the Automata Theory and the hierarchy of grammars from the Generative Grammar Theory  – any sequence of symbols is the result of a rule-based computation using a processing system that relies upon one of the possible levels of algorithmic complexity. In this sense, there is the need to adopt another mathematical framework to support the analysis of the classes of models: Category Theory. The conciliation of Category Theory and Formal Language Theory in the form of a meta-theoretical framework makes any category of social process to be equivalent to a category of formal language.

The instances of the same category of social process may differ between them in terms of a surprising or anomalous fact that becomes a new outcome for a type of event. This divergence introduces ambiguity in the abstract structure of this category of social process because there are two possible explanations to the same sequence of outcomes. Firstly, consider solving this empirical divergence with the grammar using the logic of retroduction rather than the logic of falsification. This procedure consists of modifying the ambiguous rules to become mildly context-sensitive rules; this modification extends the initial category turning it into a new subcategory. The resulting grammar increases in complexity to the level of an indexed grammar or other mildly context-sensitive grammar class; usually, this level of computational complexity is enough to describe configurations of ordered chains of past event outcomes.

Categorical-Generative Analysis  adopts an analytical method that searches for the grammar with the highest algorithmic complexity fitting to the sequence of event outcomes under inquiry. If each event outcome relies on the prior event outcome only, then the solution to this classification problem could be a regular grammar, which is "memoryless" – it cannot "plan ahead" its future occurrences. However, this is a strong constraint on the nature of social processes. Categorical-Generative Analysis assumes a mildly context-sensitive grammar as the plausible algorithmic complexity level for social processes because recursion is necessary to grasp path dependence and an ordered chain of past event outcomes is adequate to grasp contingence. There is an ontological reason for why context-sensitive languages are more computationally complex than the processes in the social realm, which is still further explored in this paper. Here, it is enough to understand the relationship between the context of past event outcomes and the so called surprising or anomalous fact.

Context-free Grammar Induction using the Nevill-Manning Algorithm
The language-learning problem is that of acquisition of the syntax or syntactic structure through the set of grammar rules for generating valid sentences of a language. The approach of generative linguistics to the studying of similarities and differences between different languages highlights common and divergent features in a family of languages. Here, there are some important questions: the possibility of learning a language from a set of valid sentences; the role of previous knowledge; the types of inputs available to the learner (either positive or negative); and the effect of semantic information on learning the syntax of a language. The key point here is the distinction between syntax and semantics to the studying of language in the perspective of Generative Linguistics. In this sense, there are algorithms to learn a set of syntactical (or grammatical) rules for generating sequences of symbols belonging to the same language (or its grammar), which creates a syntactic pattern classification system. This kind of technology applies to speech recognition and sequencing of chromosomes. Grammar inference (or induction) is a kind of algorithmic learning procedure for an unknown, target grammar from a set of sentences labeled as belonging or not to a given language. This systematic approach to the problem of syntax acquisition requires the selection of the appropriate class of formal language (or class of formal grammar). This is the basic model for selection-based classification much like the logistic regression (or the logit model) and C4.5 (or the decision-tree model) are the basic models for sample-based classification. The algorithm of grammar induction yields an approximation to an unknown, target grammar from a set of candidate grammars defined over the same alphabet set. It may rely upon different types of information such as positive and/or negative examples and the availability of domain specific previous knowledge. For instance, the impossibility of learning regular grammars from positive examples only was already demonstrated. Therefore, it must be considered in this kind of research design.

There exist grammar induction algorithms to build up the ruleset P for some strings under the assumption of a specific computational complexity level. The comparison of the resulting ruleset for a pair of strings determines one of three possibilities: (1) they fit into the same category of language; (2) they fit into subcategories sharing a common generative structure; (3) they do not relate to each other. In this sense, the problem solved by Categorical-Generative Analysis resembles the so-called Configuration Analysis using Qualitative Comparative Analysis; nevertheless, its formal model relies on a level of computational complexity greater than that of combinational logic.

The Nevill-Manning algorithm, also known as Sequitur, builds up a hierarchical structure by recursively replacing frequent component substrings with a novel symbol to be added to the set of system states (N), which is in the head of a rule making that substring. Sequitur is a greedy algorithm to minimize the objective function of the size of the representation of the string with no information loss. The resulting tree-like structure encompassing its optimal solution is also a rule set to regenerate the initial form of the string as necessary. Sequitur is a kind of data compression algorithm because it reduces the entropy of the sentence under analysis by transforming the observed regularities in production rules; nonetheless, it can also be regarded as a classification algorithm for sequences of symbols belonging to the same language.

The algorithm reads the symbols of the sentence serially while storing them in the form of an initial grammar rule like S → γ, where S $$\in$$ N and γ $$\in$$ (N $$\cup$$ Σ)*. Every time a bigram (ai, aj) $$\in$$ (N $$\cup$$ Σ)2 occurs twice in the substring γ, a new production rule in the Chomsky Normal Form like ak → ai, aj is added to the rule set. Both occurrences of this bigram are also replaced by the new symbol ak ϵ N in the initial production rule such that it becomes S → αAkβAkω. Finally, this optimization algorithm checks the rule set under construction if it satisfies a pair of constraints: (a) diagram uniqueness, by which every conceivable bigram occurs only once in the modified S → γ ; and (b) rule utility, by which every symbol a ϵ N occurs at least twice in the body of other rules or of a single rule. The algorithm performs this routine until no symbol is left unread.

Mildly Context-Sensitive Rule Induction using the Quine-McCluskey algorithm
In a decision-making event, choosing one of the possible alternative event outcomes is a logical function of a set of empirically verifiable contextual conditions. The analytical representation of a logical function is the disjunction of all conjunctions of conditions observed in the empirical evidence about past event outcomes resulting in the outcome of interest. In a within-case study, Configuration Analysis identifies these conjunctions of conditions with an iterative procedure relying on several rounds of minimization of terms. The Quine-McCluskey (QMC) algorithm must calculate all the essential prime implicants given a fixed, finite set of hypothesized contextual conditions. The measures of coverage and consistency enable to grasp if the inferred configurations of conditions can explain all event outcomes; in contrast, there is a need to include or remove some condition to run a new round using QMC. The researchers need specialized knowledge to interpret configurational patterns; explaining the observed heterogeneity using this inductive approach to infer the validity of the hypothesized set-theoretic relation between the event outcomes and all possible configurations of contextual conditions. Each observed conjunction of conditions is sufficient to imply an event outcome but each condition alone is not necessary unless it occurs in all conjunctions. If a condition is necessary, then the set of instances with an outcome of interest is a subset of the set of instances with this condition. However, if a condition is sufficient, then the instances with this condition consist of a subset of those with the outcome. If there are different configurations of conditions implying the same event outcome, then there is multiple causal “recipes,” a feature also named equifinality. Nonetheless, these are not “causal paths” unless the order of occurrence of these conditions is under consideration. Precisely, the Categorical-Generative Analysis addresses this weakness of the existing computer-supported configurational analysis approaches with a nested stack from an indexed grammar to learn the exact order in which the past event outcomes have occurred. Each configuration of the order of occurrence of each of the conditions leading to a specific event outcome goes through QCA to determine if the order matters after all (or not). Given a positive result, the generative grammar of the process instance under analysis, once regarded as a non-deterministic context-free grammar, now turns out to be mildly context-sensitive: by elimination of the contradictory lines evidencing grammar ambiguity using the cs-QCA’s configuration analysis on the contextual conditions in a within-case study. Nonetheless, the decision about the analytical generalization of the context-sensitivity of a subcategory of the social process to the theory requires a cross-case study yet.

Cross-case Comparison of Formal Grammars
The candidate grammars that describe instances of a social process belong to one of the classes of the Chomsky hierarchy. The candidate model best representing a category of phenomenon is the one explaining all the empirical patterns synthetically: it minimizes the redundancy of the sequential pattern of social events much like it was a specific formal language. This methodology seeks to minimize the entropy within the empirical data collected from the structured text narrative (using a database of events). In the within-case analysis of sequences, the first fitting criterion is the minimization of the entropy: identifying the repetitive sequence patterns of event outcomes that turns out to be grammar rules. The instances of all types of events sorted together by their chronological order become the input to the sequential analysis of event outcomes; it relies on the assumption of being a result from the same generative structure underlying the social system's instance under inquiry. A parser for the formal grammar with the greatest computational complexity provides the model that minimizes the entropy in the instance's data set; it is inferred from the bulk of empirical data in the search for sequence patterns. Thus, the sequence analysis of the structured text narrative (in a database of events) identifies the relations of recursion between the grammar rules describing the chain of event outcomes. The resulting grammar is in the context-free class of computational complexity.

The second fitting criterion in the within-case analysis of sequences of critical events is the minimization of stochasticity in the generative structure of the social process: substituting the alternative grammar rules departing from the same system state for mildly context-sensitive rules. The elimination of ambiguity due to an anomalous or surprising fact is always necessary, but the reduction of the uncertainty about the evolution of the social process due to the contextual conditions in a non-deterministic system state is also desirable despite of being non-mandatory; although at the cost of the predictability of the state transitions, of course. For each alternative state transition departing from the same system state, the researcher needs to find out configurations of contextual conditions preceding it in the chain of event outcomes. Each configuration of contextual conditions implies an alternative context-sensitive rule for this system state, that is, the resulting formal grammar will be at least in the mildly context-sensitive class of complexity. It still highlights the depth of path dependence as well as the pattern of contingency in the behavior of this instance of the social process.

Finally, there is one fitting criterion to stop the cycles of cross-case analysis that is the guarantee of representability: determining the number of case studies using the ordinary criteria of multiple cases study. Because of the specificity arising from the mildly context-sensitive rules, different case studies should not be confronted one with each other using sample-based, statistical methods. In fact, this research practice violates an epistemological assumption of Critical Realism about the contingent nature of causal explanations. However, the resulting grammars for each case study relying on the same theoretical structure consist of the same implementation of a concrete category of process, such that they may be submitted to cross-case analysis (i.e., the comparison of their derivation paths). Each pair of subsequent case studies must go under assessment for the natural equivalence between their categorical structure after solving the divergences between theory and empirical evidence in a within-case study. This task provides the analytical generalization of the research conclusions to the theory itself.

Conversely, there is a cost-benefit relation when choosing the computational complexity level: the higher complexity of the resulting grammar implies the higher risk of misunderstanding a structured text narrative or database of events as not belonging to this category of social process. Analogously to comparative studies, the greater the number of the unities of analysis needed to reach theoretical saturation too.

Conclusions
In order to assume a grammar such as a model for scientific inquiry, the mathematical properties of the chosen formalism must respect the ontological and epistemological assumptions of the adopted scientific paradigm: this work aligns the assumptions of Critical Realism with the mathematical properties of Formal Language Theory. However, the researcher must still choose the adequate grammar formalism for the phenomenon under inquiry. The procedure for the refinement of a grammar model grounded on empirical data replaces sets of alternative state-transitions by their equivalent mildly context-sensitive rules. Alternative rules describe either equifinal paths or competing explanations for a category of process. On the one hand, equifinality results from the inherent complexity of emergent phenomena defined by incomplete theoretical statements, for example, due to cumulative effects; and a set of equifinal rules generates distinct, mutually exclusive partitions of a domain of instances of a type of process. On the other hand, competing explanations results from the acknowledgement of the contradictory facts; and a set of ambiguous rules generates distinct, competing derivation paths for at least one instance of a type of process. In both cases, identifying configurations of contextual conditions associated with each alternative event outcome improves the predictability of the state transition, but increasing the explanatory power of the model at a cost of increasing the computational complexity too.

The epistemology to guide a methodology based on either Set Theory or Formal Language Theory is Critical Realism instead of Social Positivism. Consequently, there is no sense in talking about the rejection of a theoretical proposition because what is in use is not a logic of refutation, but a logic of retroduction. In addition, different learning algorithms make different assumptions about the abstract structure underlying the data set and have different rates of convergence. The one working best minimizes the cost function of interest (cross-validation, for example), that is, it is the one making assumptions that are consistent with the data and has been sufficiently converged.

Acknowledgements
Any people, organisations, or funding sources that you would like to thank.