Use of linkage to improve the completeness of the SIM and SINASC in the Brazilian capitals

ABSTRACT OBJECTIVE To analyze the contribution of linkage between databases of live births and infant mortality to improve the completeness of the variables common to the Mortality Information System (SIM) and the Live Birth Information System (SINASC) in Brazilian capitals in 2012. METHODS We studied 9,001 deaths of children under one year registered in the SIM in 2012 and 1,424,691 live births present in the SINASC in 2011 and 2012. The databases were related with linkage in two steps – deterministic and probabilistic. We calculated the percentage of incompleteness of the variables common to the SIM and SINASC before and after using the technique. RESULTS We could relate 90.8% of the deaths to their respective declarations of live birth, most of them paired deterministically. We found a higher percentage of pairs in Porto Alegre, Curitiba, and Campo Grande. In the capitals of the North region, the average of pairs was 84.2%; in the South region, this result reached 97.9%. The 11 variables common to the SIM and SINASC had 11,278 incomplete fields cumulatively, and we could recover 91.4% of the data after linkage. Before linkage, five variables presented excellent completeness in the SINASC in all Brazilian capitals, but only one variable had the same status in the SIM. After applying this technique, all 11 variables of the SINASC became excellent, while this occurred in seven variables of the SIM. The city of birth was significantly associated with the death component in the quality of the information. CONCLUSIONS Despite advances in the coverage and quality of the SIM and SINASC, problems in the completeness of the variables can still be identified, especially in the SIM. In this perspective, linkage can be used to qualify important information for the analysis of infant mortality.


INTRODUCTION
Child mortality is an important indicator of maternal and child health, being considered as a sentinel event because of its avoidability with the provision of adequate living conditions and care quality and access for pregnant women, the delivery, and newborns 1 .
Despite contemporary advances, infant mortality persists as a public health problem in the world, especially in poorer countries and regions, which has led the United Nations to include the reduction of child mortality by two-thirds as one of the eight Millennium Development Goals 1 .
The monitoring and evaluation of the compliance with the millennium goal of reducing infant mortality implies the availability and adequacy of health information, which is still a challenge to be faced by the health sector in the Americas 2 . In Brazil, the Mortality Information System (SIM) and the Live Birth Information System (SINASC) have been developed by the Ministry of Health in view of the need to know the epidemiological situation of deaths and births in the country 3 .
By identifying limitations in the coverage and quality of the information produced by these systems 4,5 , the Ministry of Health made investments that resulted in a marked improvement in the SIM and SINASC in relation to both the coverage and quality of their data over the last two decades 3,6-8 . However, this coverage is heterogeneous in the country, with large variations among the states and some with low percentages, particularly those located in the North and Northeast regions 9 .
The analysis of the health situation and planning actions to reduce child mortality is provided by the availability of information of adequate quality. The access to reliable data allows the identification of the conditions of births and deaths and their determinants with greater validity 10 .
Several aspects, such as the completeness of variables, reliability, and consistency, need to be analyzed in the evaluation of the quality of vital statistics 11 . Underreporting and the presence of unknown or unfilled variables also compromise the reliability of the data and, consequently, the use of real data on infant mortality in Brazil 12 .
In this perspective, we highlight that the analysis of completeness is an important dimension of the evaluation of the quality of the information, exposing the lack of care and importance given to the filling by health professionals, the absence of data in medical records, and even the lack of knowledge on certain information by the woman' s or the child' s companions 13 . These deficiencies are also the result of the access to health services, diagnostic technologies, and the ability of the medical professional to recognize the dynamics of the events that participated in the causal chain of death, as well as their relationship with the production of reliable statistics 14 .
Several authors have used linkage as a strategy to improve the quality of information, since this procedure allows the recovery of incomplete or inconsistent records, thus improving the completeness and reliability of the information provided by the SINASC and SIM 15-17 . In addition, linkage has low operational cost and is ease to manage, which facilitates the more qualified analysis of the health situation and the monitoring of the prevalence of risk factors and their magnitude in the population of live births, also facilitating the planning and evaluation of the maternal and child health 18,19 .
In this perspective, this study aimed to analyze the contribution of linkage between databases of live births and infant mortality to improve the completeness of the variables common to the SIM and the SINASC in Brazilian capitals in 2012.

METHODS
This is an observational, cross-sectional study carried out in the 26 Brazilian capitals and the Federal District, in which we analyzed the information related to 9,001 deaths of children under one year registered in the SIM in 2012 and 1,424,691 live births present in the SINASC in 2011 and 2012. The data entry documents for these systems are the declaration of death (DD) and the declaration of live birth (DLB).
We used the linkage technique as a methodological tool, applying the deterministic and probabilistic methods.
The first step of the linkage (deterministic) was performed from the identification of the unifying variable common to both systems, the DLB number. For this purpose, we used one of the research and reference functions (PROCV) provided in the software Microsoft  Office Excel 2007.
For unpaired records at this step, we used probabilistic linkage using automated routines (standardization, linkage, and combination of files), based on common fields present in both databases, in order to identify the probability of a pair of records belonging to the same individual.
We applied the probabilistic method using the multi-step strategy, associated with a manual review of the doubtful pairs, seeking to classify them as true pairs or non-pairs. The blocking fields used were the soundex of the mother' s first name, the soundex of the mother' s last name, the child's gender, and the mother's age. For comparison, we use the mother's name and the child's birth date. The variables used as a decision criterion during the manual inspection of pairs were the residence address, neighborhood of residence, complement of the residence address, and age of the mother and the date of death and year of birth of the child.
All probabilistic step processing was performed using the software Reclink III  version 3.0.4.4005, a free program that allowed us to associate files based on the probabilistic linkage of records. At the end of this process, the files from the deterministic and probabilistic steps were unified, followed by the filling of the incomplete fields in the SIM and SINASC.
We selected the variables common to the two databases to analyze the percentage of completeness, namely: sex and race of the child, age of the mother, education level of the mother, occupation of the mother, number of children born alive, number of stillbirths, type of pregnancy, length of pregnancy, birth weight, and type of delivery.
We defined data as incomplete when fields were not filled in or when they were marked as unknown. For each variable common to the systems, we analyzed the filling before and after linkage, categorizing them according to the criteria of incompleteness proposed by Romero and Cunha 12 : excellent (< 5%), good (5% to 9.9%), regular (10% to 19.9%), poor (20% to 49.9%), and very poor (≥ 50%).
In addition to the completeness aspect of the variables, considering the assumption that the better the quality of the information the greater the chance of linkage success, we analyzed the association between the pairing of the SIM and SINASC, the capital of residence of the child, and the component of infant mortality. This last one expresses the subgroups of the age of death of children under one year, classified as neonatal ( from zero to 27 days of life) or post-neonatal ( from 28 days to one year of life).
To this end, we considered as cases the infant deaths not paired to their respective DLB and as controls those whose birth and death records were concatenated. We calculated odds ratios (OR) to verify the association between non-pairing (outcome) between systems and independent variables (capital and child death component). We evaluated the significance of the differences between the proportion of pairs and non-pairs from the linkage of the databases using the chi-square test, with statistical significance (p < 0.05).
This study used secondary data from the SIM and SINASC provided by the Ministry of Health by the technical opinion from the Department of Health Surveillance (SVS/MS), the term of assignment and use of databases, and the signing of the statement of responsibility by the researchers. This work is part of the research named "Determinantes da mortalidade infantil nas capitais brasileiras: Uma análise multinível nos contextos individual, da assistência à saúde e socioeconômico", conducted within the standards of scientific ethics, approved by the Research Ethics Committee (Record in the CAAE -35632414.5.0000.5190).

RESULTS
Of the 9,001 deaths of children under one year registered in the Brazilian capitals in 2012, we could relate 90.8% of the DD to their respective DLB (Table 1).
Porto Alegre, Curitiba, and Campo Grande presented the highest percentage of paired deaths. In Vitória and Recife, the proportion of associated records was expressive, exceeding 97%. On the other hand, the lowest percentages of linkage success were observed in Belém, Brasília, Boa Vista, and Natal. On average, the capitals of the South and Southeast regions presented better results in the linkage of the databases, while this result was lower in the capitals of the North region (Table 1).
Despite the variations in the linkage results between cities and regions, the success rate of database linkage was over 80% in all analysis units (Table 1).
In relation to the type of linkage, most pairs were obtained with the deterministic method (69.2%), while probabilistic linkage resulted in 21.7% of the concatenated records (Table 1).
In 22 capitals, we observed a greater contribution of the deterministic method for the pairing of the death and birth data. In cities such as Porto Alegre, Curitiba, Campo Grande, São Paulo, and Aracaju, the linkage results exceed 90% of pairing, and the first three cities reached more than 98% (Table 1).
On the other hand, in Rio Branco, Palmas, Belo Horizonte, Maceió, and Goiânia, paired records were predominant with the use of probabilistic linkage. We highlight Rio Branco, in which 92.4% of the deaths were related probabilistically. In Campo Grande, Porto Alegre, and São Paulo, this method had no contribution (Table 1).
In the average of all the regions, we observed a predominance of the deterministic linkage, which is more expressive in the South region. In the Northeast region, the probabilistic method contributed with 30.3% of the pairs, followed by the North region, with 26.8% (Table 1).
In the analysis of completeness of the 11 variables common to the analyzed systems, we found 11,278 unknown or unfilled fields in total, being 9,016 (79.9%) in the SIM and 2,262 (20.1%) in the SINASC. After linkage, we could recover 10,307 records, leaving only 971 incomplete fields, which is equivalent to a reduction of 91.4% in incompleteness. This increase in the completeness of the information was more significant for the SIM, which went from an average of 10% of incompleteness to 1.1%.
Moreover, in the total, regarding the categorization of the completeness of the variables studied before pairing in the SIM, we observed one excellent, six good, three regular, and one poor variable. For SINASC, eight of them were excellent and three were good. After linkage, all became excellent (Figure).
Among these variables, the sex of the child stands out, with excellent completeness in both systems. The occupation of the mother had the poorest quality, with 21.7% of the information being unknown in the SIM, followed by length of pregnancy, education level of the mother, and race. In the SINASC, length of pregnancy, occupation of the mother, and race also presented lower percentages of completeness. Nevertheless, the proportion of unfilled fields did not exceed 10% in any of them (Figure).
In the SIM, the variable of the sex of the child had excellent completion in all capitals even before linkage. The number of children born alive was also classified as excellent in 17 cities, among them Porto Alegre, Recife, Rio Branco, São Paulo, and Campo Grande, with less than 1% of unknown data ( Table 2).
The variables of occupation of the mother, length of pregnancy, education level of the mother, and race were categorized as regular, poor, or very poor in more than half of the   cities studied. The occupation of the mother was classified as very poor in four capitals -Salvador, Goiânia, Natal, and Porto Velho -and as poor in nine other ones. The length of pregnancy presented 94.5% of incompleteness in Rio Branco and between 20% and 40% in 11 capitals. The education level of the mother had 57.1% of missing data in Porto Velho, and it was identified as poor in eight other cities. The variable of race had poor completeness in six cities, among which Fortaleza stands out, with 40.2% (Table 2).
After the linkage of the databases, six variables became excellent in all cities in the SIM, namely: age and education level of the mother, number of children born alive, type of pregnancy, type of delivery, and birth weight. The variable of number of stillbirths was excellent in 26 capitals, except in Palmas. The variable of race in 24 cities and the occupation of the mother and length of pregnancy in 20 cities also became excellent. No variables were categorized as poor or very poor in the cities studied after linkage ( Table 2).
Regarding the SINASC, five variables were considered as excellent before the linkage of the databases in all capitals: sex of the child, age of the mother, type of pregnancy, type of delivery, and birth weight. As excellent completeness, we can mention the education level of the mother and the number of children born alive in 25 cities and the number of stillbirths in 24 capitals (Table 3).
Race was the poorest variable in the highest number of cities, among which we can highlight São Luís -where incompleteness reached 36.8% -, Fortaleza, Brasília, Teresina, and Rio Branco. Length of pregnancy was categorized as poor in São Luís, Teresina, and Porto Velho, and as regular in six other cities. The occupation of the mother had poor completeness in Belo Horizonte and Natal, and regular in six cities (Table 3).
After linkage, seven variables were identified with excellent completeness in all cities in the SINASC, with the exception of race, occupation of the mother, number of stillbirths, and length of pregnancy. Race went from poor to regular in São Luís and Fortaleza. Another variable with regular status was the occupation of the mother, thus classified in Natal, Macapá, and Salvador. Length of pregnancy remained as 15.1% of incompleteness in Rio Branco and went to 10.6% in Teresina. At the end of the linkage of the databases, no variables were categorized as poor or very poor in the capitals (Table 3).
When analyzing the completeness of the variables according to capitals, two cities presented all the variables categorized as excellent for both the SIM and SINASC, even before linkage (Campo Grande and Cuiabá). In addition, four other capitals had 11 excellent variables in the SINASC before the linkage of the databases, which went to 14 in both systems after pairing (Tables 2 and 3).
Campo Grande, Cuiabá, Porto Alegre, Recife, Vitória, and São Paulo were the capitals with the best quality in completeness of information in the SIM, before linkage, with more than nine variables classified as excellent. In the SINASC, 22 cities presented more than 80% of excellent variables, and we highlight Campo Grande, Cuiabá, Porto Alegre, Recife, João Pessoa, and Boa Vista, as they presented all variables in this category (Tables 2 and 3).
Salvador, São Luís, Fortaleza, Rio Branco, Macapá, Natal, and Teresina, capitals of the North and Northeast regions, had the highest percentage of incompleteness of information, even after linkage (Tables 2 and 3).
We identified a significant association between the city of birth of the child, the component of infant mortality, and the success of the linkage between databases (Table 4).
Taking all capitals as reference, we can observe that this association presented statistical significance (p > 0.05) in 15 cities. In seven of them, there was a greater chance of non-pairing between the SIM and SINASC, among which we highlight Belém, Brasília, and Natal, with the highest values of OR. On the other hand, eight capitals had a lower chance of non-linkage of the significant records, with values of OR lower than 1.00, especially Campo Grande, Porto Alegre, Curitiba, and Recife (Table 4).
In the aggregate analysis of capitals by macro-region, the North and Midwest regions had greater probability of non-pairing. In the North region, four of the seven capitals had greater probability of non-linkage between databases. In the South and Southeast regions, we identified lower chances of linkage failure, with values of OR lower than 1.00 with statistical significance. This association was not significant only for the Northeast, being close to the average of the capitals (Table 4).
Regarding the death component, we found that, overall, deaths in the post-neonatal period had a greater chance of non-pairing of the death certificate records in relation to the respective DLB, when compared to neonatal deaths (Table 4).

DISCUSSION
The high percentage of linkage success between the SIM and SINASC for the total Brazilian capital indicates the quality of the information of these systems. However, we observed regional differences, with better results in the capitals of the South and Southeast regions, while the These results corroborate the findings of other studies [15][16][17] and were superior when compared to previous research studies, whose linkage of birth and death databases resulted in percentages of pairs ranging from 40% to approximately 70% [20][21][22] .
We also highlight the predominance of the deterministic method to obtain paired records in 22 capitals. This result is related to the record of the number of the DLB, a univocal variable to the SIM and SINASC and of mandatory fulfillment for the deaths of children under one year. In cities such as Porto Alegre, Curitiba, and Campo Grande, the deterministic linkage results reached more than 98%. In contrast, we could not deterministically retrieve any records for Rio Branco, and the data was only correlated using the probabilistic method, as the city presents an expressive gap in the completeness of the DLB number in the SIM.
Similar results were observed in a study that has analyzed the contribution of linkage between the SIM and SINASC in five Brazilian cities, showing the predominance of the deterministic method 13 . Mendes et al. 14 also discuss the relationship between the type of linkage and the quality of the information, verifying that, the larger the municipality, the more the records are paired deterministically and the smaller the occurrence of non-pairs; conversely, the smaller the municipality, the greater the contribution of the probabilistic method and the larger number of non-paired records.
The evaluation of completeness allows us to measure the frequency of information that is "unknown" or "unfilled". Unknown variables are the product of a series of deficiencies, such as lack of information in the medical records and lack of knowledge of certain information by the woman's companions, while unfilled variables are a reflection of the lack of care and importance given by the responsible professional to the filling of the information 22 .
Among the results of this study, regarding the analysis of completeness of the 11 variables common to the SIM and SINASC, we highlight the significant number of unknown or unfilled fields retrieved after linkage. For all 27 capitals, this contribution was greater for the SIM, which presented only one excellent variable before linkage, and all variables became excellent after the application of this technique.
Research studies that have used linkage between these information systems also suggest a more expressive contribution of SINASC to the SIM [15][16][17] . This is because, in general, the SINASC still presents superior quality to the death data registered in the SIM, both in terms of coverage and completeness and reliability of the information 3,7,8,14 .
The SINASC presented, even before linkage, good to excellent completion for most variables.
A study that has analyzed the completeness of information in the DLB and the report of early neonatal and fetal death in the region of Ribeirão Preto, State of São Paulo, Brazil, has observed less than 10% of DLB with unfilled fields during the period from 2000 to 2007, and it also has detected an increasing trend in the quality of the filling 18 . These findings also resemble the research studies carried out in the Northeast region and in Pernambuco 11,23 .
In relation to the SIM, this study found high percentages of incompleteness. Only the variable of sex of the child was considered as excellent before linkage in the aggregate analysis of capitals. These deficiencies in the filling of important variables results in limitations in the potential use of the system for epidemiological studies 24 .
Among the variables studied, we highlight that the occupation of the mother, education level of the mother, race of the child, and length of pregnancy had lower filling quality, being categorized as regular, poor, or very poor in more than half of the Brazilian capitals, with a more significant percentage of incompleteness in the SIM. These findings corroborate what has been found in the literature 6,11,15,18 .
Maternal variables such as education level, occupation, and age, as well as race of the child, are considered important indicators of the socioeconomic conditions of the woman.
In addition to problems related to the methodological clarity of the instructions for collecting and completing these fields, Romero and Cunha 10 also suggest a correlation between completeness and the indicators of poverty, economic inequality, and human resources in health. In addition, the omission of data on these variables hinders studies on social disparities and infant mortality 6 .
We also highlight the high percentage of incompleteness of data on length of pregnancy in the SIM, which is an important predictor of infant mortality [19][20][21] . This variable was categorized as poor in 11 capitals studied, and we highlight Rio Branco, where the proportion of unfilled fields reached 94.5%. The application of linkage allowed us to retrieve information, and only two cities had regular completeness, while the other capitals reached the status of excellent or good.
The number of children born alive and the number of stillbirths, variables related to maternal parity, had good completeness in the SIM and excellent completeness in the SINASC for all capitals. These results differ from the research that has evaluated the quality of the information in the SINASC in the States of Brazil, showing that the parity variables were among those that showed greater incompleteness and lower consistency 12 .
Information related to the newborn, such as sex and birth weight, as well as the type of pregnancy and type of delivery, presented a very low frequency of unknown information for both the SIM and SINASC, corroborating the findings of other studies 11,12,15,22 .
Deaths in the post-neonatal period had a greater chance of non-pairing between the SIM and SINASC. This finding reinforces the precept that the investigation of infant mortality is more timely and robust the closer the birth and death events. In addition, we highlight the high occurrence of neonatal deaths in the hospital setting, which allows the better filling of information 13,14 .
We could also identify differences in the quality of information among the Brazilian capitals, as well as the completeness of the variables, as to the results of the linkage between databases, although with heterogeneous realities within the regions.
In the capitals of the North and Northeast regions of Brazil, such as Salvador, São Luís, Fortaleza, Rio Branco, Macapá, Natal, and Teresina, we found the highest percentage of incompleteness of information, even after pairing. In addition, the North and Midwest regions presented a lower chance of linkage success in relation to the other capitals, with statistical significance.
The quality of the records is related to the human and technological development conditions of each region 24 , and it is considered a way of expressing inequities in the health care of more vulnerable groups, particularly the barriers to access to services 25 . It also reveals obstacles in the generation and consolidation of health information, such as: little importance given to the adequate filling of information by physicians and administrative staff, problems related to clarity in the instructions provided by the Ministry of Health, failures in typing the data in the system, and deficiencies in the corrections of the SIM database after the death investigation process 6,12,23,26 .
Despite advances in the coverage and quality of birth and death information systems in recent decades, the results of this study reinforce the existence of limitations. These problems were identified especially in the SIM and they are related to the filling of the death certificate and the completeness of the variables, which restrict the research on the risk factors for infant mortality.
This study indicated the linkage method between the SIM and SINASC for the qualification of databases, enabling the recovery of unfilled or unknown information of important variables for the analysis of infant mortality.
Given its low operational cost and ease of execution, we recommend the incorporation of linkage allied to the surveillance of infant mortality in the routine management of the SUS in its different spheres to improve information on health, understood as a strategic element for the analysis of the health situation and consequent decision making.
This study also suggests the need for investigations regarding other aspects of the quality of vital statistics, such as the presence of filling errors, identification of inconsistencies, and analysis of agreement between records of the same individual present in the different information systems.