Monday, December 14, 2009

Utility: Find similar strings using Levenshtein distance

This posting is basically to show the Levenshtein distance implementation in ABAP. This implementation was created to find if there is already a similar name of a given name. For example if you are searching for ‘John’ then ‘Johnny’ could also be presented in the search.

The Levenshtein distance gives the difference of characters between two strings. So in the case of ‘John’ and ‘Johnny’ the distance would be 2. The implementation is done using a class-method. The method accepts string1 and string2 and returns the distance. The program calling the method can then decided if the distance is acceptable or not.

image

METHOD get_levenshtein_distance.
  DATA lt_matrix TYPE REF TO data.
  DATA lst_matrix TYPE REF TO data.
  DATA l_str_len1 TYPE i.
  DATA l_str_len1_tmp TYPE i.
  DATA l_str_len2 TYPE i.
  DATA l_str_len2_tmp TYPE i.
  DATA l_count1 TYPE i.
  DATA l_count2 TYPE i.
  DATA l_i TYPE i.
  DATA l_j TYPE i.
  DATA l_i_1 TYPE i.
  DATA l_j_1 TYPE i.
  DATA l_field_value TYPE i.
  DATA lo_cl_abap_datadescr.
  DATA l_col_name TYPE string.
  DATA l_n_index(3) TYPE n.
  DATA lt_abap_component_tab TYPE abap_component_tab.
  DATA lst_abap_component TYPE abap_componentdescr.
  DATA lo_abap_datadescr TYPE REF TO cl_abap_datadescr.
  DATA lo_abap_structdescr TYPE REF TO cl_abap_structdescr.
  DATA lo_abap_tabledescr TYPE REF TO cl_abap_tabledescr.
  DATA lt_number TYPE TABLE OF i.
  FIELD-SYMBOLS <lt_matrix> TYPE table.
  FIELD-SYMBOLS <lst_matrix> TYPE data.
  FIELD-SYMBOLS <l_field> TYPE data.
  FIELD-SYMBOLS <l_field2> TYPE data.
* Create data type of number
  lo_abap_datadescr ?= cl_abap_datadescr=>describe_by_data( 1 ).
* Get the string lengths
  l_str_len1 = STRLEN( i_string1 ).
  l_str_len2 = STRLEN( i_string2 ).
* Create (l_str_len2 + 1) number of columns
  l_str_len2_tmp = l_str_len2 + 1.
  DO l_str_len2_tmp TIMES.
    l_n_index = sy-index.
    CONCATENATE 'COL' l_n_index INTO l_col_name.
    lst_abap_component-name = l_col_name.
    lst_abap_component-type = lo_abap_datadescr.
    APPEND lst_abap_component TO lt_abap_component_tab.
  ENDDO.
  lo_abap_structdescr = cl_abap_structdescr=>create( p_components = lt_abap_component_tab ).
  lo_abap_tabledescr = cl_abap_tabledescr=>create( p_line_type = lo_abap_structdescr ).
  CREATE DATA lt_matrix TYPE HANDLE lo_abap_tabledescr.
  ASSIGN lt_matrix->* TO <lt_matrix>.
  CREATE DATA lst_matrix TYPE HANDLE lo_abap_structdescr.
  ASSIGN lst_matrix->* TO <lst_matrix>.
* Initialise first row
  WHILE sy-subrc IS INITIAL.
    ASSIGN COMPONENT sy-index OF STRUCTURE <lst_matrix> TO <l_field>.
    IF sy-subrc IS NOT INITIAL.
      EXIT.
    ENDIF.
    <l_field> = sy-index - 1.
  ENDWHILE.
  APPEND <lst_matrix> TO <lt_matrix>.
* Create (l_str_len1 + 1) number of rows
* First row is already created
  DO l_str_len1 TIMES.
    CLEAR <lst_matrix>.
    ASSIGN COMPONENT 1 OF STRUCTURE <lst_matrix> TO <l_field>.
    <l_field> = sy-index.
    APPEND <lst_matrix> TO <lt_matrix>.
  ENDDO.
  l_j = 2. " Column
  DO l_str_len2 TIMES.
    CLEAR l_count1.
    l_i = 2. " Row
    DO l_str_len1 TIMES.
* i - 1
      l_i_1 = l_i - 1.
* j - 1
      l_j_1 = l_j - 1.
* Get d[i, j]
      READ TABLE <lt_matrix> INDEX l_i ASSIGNING <lst_matrix>.
      ASSIGN COMPONENT l_j OF STRUCTURE <lst_matrix> TO <l_field2>.
      IF i_string1+l_count1(1) = i_string2+l_count2(1).
* Get d[i-1, j-1]
        READ TABLE <lt_matrix> INDEX l_i_1 ASSIGNING <lst_matrix>.
        ASSIGN COMPONENT l_j_1 OF STRUCTURE <lst_matrix> TO <l_field>.
*         d[i, j] := d[i-1, j-1]
        <l_field2> = <l_field>.
      ELSE.
        CLEAR lt_number[].
*                      d[i-1, j] + 1,  // deletion
        UNASSIGN <l_field>.
        READ TABLE <lt_matrix> INDEX l_i_1 ASSIGNING <lst_matrix>.
        ASSIGN COMPONENT l_j OF STRUCTURE <lst_matrix> TO <l_field>.
        l_field_value = <l_field> + 1.
        APPEND l_field_value TO lt_number.
*                      d[i, j-1] + 1,  // insertion
        UNASSIGN <l_field>.
        READ TABLE <lt_matrix> INDEX l_i ASSIGNING <lst_matrix>.
        ASSIGN COMPONENT l_j_1 OF STRUCTURE <lst_matrix> TO <l_field>.
        l_field_value = <l_field> + 1.
        APPEND l_field_value TO lt_number.
*                      d[i-1, j-1] + 1 // substitution
        UNASSIGN <l_field>.
        READ TABLE <lt_matrix> INDEX l_i_1 ASSIGNING <lst_matrix>.
        ASSIGN COMPONENT l_j_1 OF STRUCTURE <lst_matrix> TO <l_field>.
        l_field_value = <l_field> + 1.
        APPEND l_field_value TO lt_number.
* Get the minimum
        UNASSIGN <l_field>.
        SORT lt_number.
        READ TABLE lt_number INDEX 1 ASSIGNING <l_field>.
*         d[i, j] := minimum
        <l_field2> = <l_field>.
      ENDIF.
      l_i = l_i + 1.
      l_count1 = l_count1 + 1.
    ENDDO.
    l_j = l_j + 1.
    l_count2 = l_count2 + 1.
  ENDDO.
* Bottom right of the array lt_matrix will have the answer
  l_str_len1_tmp = l_str_len1 + 1.
  l_str_len2_tmp = l_str_len2 + 1.
  READ TABLE <lt_matrix> INDEX l_str_len1_tmp ASSIGNING <lst_matrix>.
  ASSIGN COMPONENT l_str_len2_tmp OF STRUCTURE <lst_matrix> TO <l_field>.
  r_distance = <l_field>.
ENDMETHOD.

1 comment:

Anonymous said...

Worked perfectly. When two companies are merged: we used this to identify common customers and common vendors.