Dimitris Dimitriadis

Language Processing R&D

Full Stack Web Developer

Dimitris Dimitriadis

Language Processing R&D

Full Stack Web Developer

Blog Post

Deterministic Finite State Automota for Regular Expressions

9 February 2020 Education
Deterministic Finite State Automota for Regular Expressions

A finite state automaton (FSA) is a computational model and is defined by the following 5 parameters:

  • Q: a finite set of N states \(q_0, q_1,…,q_N\)
  • Σ: a finite input alphabet of symbols
  • \(q_0\): the start state
  • F: the set of final states, \( F \subseteq Q \)
  • δ(q,i): the transition function or transition matrix between states. Given \(q \in Q\) and an input symbol \( i \in Σ \), δ(q,i) returns a new state \(q’ \in Q\). δ is thus a relation from \( Q \times Σ \to Q \);

Using this machine, one can recognize a formal language. So, if we define a formal language with a regular expression, an automaton will be in a position to recognize it.

In this post, we will work with two formal languages to show the algorithms that can be used for recognizing the languages by finite state automaton. We will implement the algorithms in python. You can find more information about the use of deterministic finite state automata here. Furthermore, in this post one has to know what we mean “tape” and “transition matrix/table”. For regular expressions, please visit here.

The Sheep Talks (character-level tapes)

Our first example is a simple formal language for the sheep language. The sheep language is any string of the form:

  • baa!
  • baaa!
  • baaaa!
  • ….

The corresponding regular expression is ^baa+!$. So, we want a finite state automaton to recognize this regular expression.

So, following the parameters defined above, \(Σ = \lbrace a,b,! \rbrace \) while for the states and transitions we will use a transition matrix. We define these elements in python

# sheep language
transition_matrix = [
                      # b,   a,   !
                       [1, None, None],   #q0
                       [None, 2, None],   #q1
                       [None, 3, None],   #q2
                       [None, 3, 4],      #q3
                       [None, None, None] #q4    
                    ]
symbols = ['b','a','!']

The transition matrix contains the transition between two states given a letter. For example, given the letter ‘b’ in the \(q_0\) state, we can pass to the state \(q_1\). In the \(q_1\) the FSA accepts the letter ‘a’ and goes to the state \(q_2\) and so on. The None values representes the invalid characters for a given state. So, if in state \(q_0\) the letter ‘a’ will be given, then the given string (or tape) is invalid. Next we define a function that will be necessary later:

def convert_char_to_collumn(symbol,symbols):
  for i in range(0,len(symbols)):
    if symbol == symbols[i]:
      return i

This function get as input the current symbol of a string (or tape) and the list of all available symbols. The aim of the function is to convert the letters to numeric values that corresponds to the collumns of the transition matrix. For example, if the given string is “baba!” the corresponding output will be “01012”. Next we define the algorithm that recognizes a given tape.

def d_recognize(tape,transition_matrix,symbols,accepted_states):
  index = 0
  current_state = 0
  while True:
    if index >= len(tape):
      if current_state in accepted_states:
        return 'accepted'
      else:
        return 'rejected'
    elif not transition_matrix[current_state][convert_char_to_collumn(tape[index],symbols)]:
      return 'rejected'
    else:
      current_state = transition_matrix[current_state][convert_char_to_collumn(tape[index],symbols)]
      index += 1

This function get as input the tape, the transition matrix the symbols and the accepted states (F set as defined above). The aim of the function is to pass through all letters and to decide if the given tape is valid. Particulalry, it uses the below conditions:

  • if we are at the end of the tape, then if the state is final, then the given tape is accepted, otherwise is not.
  • if we are not at the end of the tape, we check if the current letter in the current state can be considered as valid (i.e. not None). If the letter is not valid then the tape is rejected. Otherwise, we go to the next state based on the transition matrix and we update the index that points to the next letter in the tape.

To test our algorithm, we define some tapes and we call the above function

tapes = ['baaa!', 'baba', 'baaa', 'baaaaaaa!', 'ba!']
for tape in tapes:
  print (tape +':\t' + d_recognize(tape,transition_matrix,symbols,[4]))

A more complex example – dollars and cents (word-level tapes)

Now, suppose that we want to recognize money with our FSA. For example

  • twenty five dollars
  • five dollars thirty cents
  • five cents
  • ……

We will work as previously but here the transition matrix is bigger due to the set of symbols.

symbols = ['twenty','thirty','forty','fifty','sixty',
            'seventy','eigthy','ninety',
           'one','two','three','four','five',
           'six','seven','eight','nine','ten','eleven',
           'twelve','thirteen','fourteen',
           'fifteen','sixteen','seventeen','eighteen','ninenteen','cents',
           'dollars']

transition_matrix = []
length = len(symbols)
q0 = []
for i in range(0,length):
  if i < 8:
    q0.append(1)
  elif i >= 8 and i < length - 2:
    q0.append(2)
  else:
    q0.append(None)
transition_matrix.append(q0)
q1 = []
for i in range(0,length):
  if i < 8:
    q1.append(None)
  elif i >= 8 and i <length - 2:
    q1.append(2)
  elif i == length-1:
    q1.append(4)
  else:
    q1.append(3)
transition_matrix.append(q1)
q2 = []
for i in range(0,length):
  if i < length - 2:
    q2.append(None)
  elif i == length - 1:
    q2.append(4)
  else:
    q2.append(3)
transition_matrix.append(q2)
q3 = [None] * length  
transition_matrix.append(q3)
q4 = []
for i in range(0,length):
  if i < length -2:
    q4.append(5)
  else:
    q4.append(None)
transition_matrix.append(q4)
q5 = []
for i in range(0,length):
  if i < 8:
    q5.append(None)
  elif i >= 8 and i <length - 2:
    q5.append(6)
  elif i == length-1:
    q5.append(None)
  else:
    q5.append(7)
transition_matrix.append(q5)
q6 = []
for i in range(0,length):
  if i < length - 2:
    q6.append(None)
  elif i == length - 1:
    q6.append(None)
  else:
    q6.append(7)
transition_matrix.append(q6)
q7 =[None] * length
transition_matrix.append(q7)

Here the symbols are all the words that corresponds to the numeric values (we ignore cases with hundreads and thousands for simplicity). The transition matrix consists of 7 states, while the states that are full of Nones mean that they are final states.

Defining the bellow strings, we will see the expected results

string_1 = "five dollars cents"
string_2 = "twenty two cents"
print (d_recognize(string_1.split(),transition_matrix,symbols,[3,4,7]))
print (d_recognize(string_2.split(),transition_matrix,symbols,[3,4,7]))

Bibliography

  • Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
Taggs:
Write a comment