Statistics Finland
SearchFeedbackSite Map
 

Session No. 1

Paper No. 8

Country: Italy

Progress Report

Giuseppe Garofalo

ISTAT

 

The ASIA project

(Setting-up of the Italian Business Register)

Synthesis of the methodological manual

 

 

Helsinki, September 1998

 

 

This paper is a synthesis of the methodological manual of the ASIA project (Italian Business Statistical Register). The several paragraphs present the main conceptual aspects, the productive process and the statistical methodologies of input, estimation and check of data adopted. Besides the document contains some tables quantifying the obtained results. In order to realize this synthesis documents and publications, drawn up by several researchers involved in carrying out the project, have been adopted, integrated and re-elaborated. Specific references are made in a special bibliographic note.

The synthesis has been edited by Giuseppe Garofalo the translation into the english language has been realized by Maria Fustaci

 

 

1. Introduction

The knowledge of universe of enterprises on a territory is a remarkable information for decisions of operators and for analysis on characteristics of production development of country and its own evolution through time. In Italy such universe was known only every 10 years, in occasion of general censuses, and with a few years of delay (3 in the last census of 1991) with respect to date of reference, because of technical necessary times to elaborate the survey patterns. Over the remarkable problems of costs and of the difficult organization, the census surveys present, for the survey "door to door" adopted technique, a reduced covering for a few sectors of activity (free professionals, intermediary of trade, building, and transports).

Limits of the census survey, and the continuous technological innovation, which reduces the times of structural modification in an industrial system, have stimulated the demand for defining and setting-up a new instrument at the centre and service of an integrated system of economic information: a unique, complete and updated statistical business register.

Planning and constructing a complete and updated statistical register have to exploit, in order to be economically feasible in a synergical processing, all informative capacity on enterprises whose information is already stored by institutions or public administrations. As a matter of fact, enterprises systematically and frequently produce administrative acts during their lives: they pay taxes, stipulate telephone or electric contracts, insure the employees against accidents on work, etc. All these are administrative acts, but in reality each of them hides information that is possible to locate and explain in statistical point of view.

On the use of administrative registers for statistical purposes have matured, from the end of 1980, a series of convictions and significant experiences of research, inside and out the Italian National Institute of Statistics, that have consented the cumulating of know how and proved the technical feasibility.

The Italian National Institute of Statistics carried out, since 1994 a complex project (called ASIA) for the realisation of a statistical business register as result of the logical and physical integration of data resident in administrative and statistical sources and their treatment with statistical methodologies.

 

2. Phases and timetable of ASIA project

ISTAT instituted in 1994 a workgroup to plan and realize the new statistical register of enterprises on base of available information in administrative field. Such group ended its works in December 1994, outlining the conceptual and organizational architecture on the base of which ASIA project has developed in 4 phases.

  • First phase, 1995. Experiment on 3 provinces. During 1995 has been realized the experiment of the first plant of ASIA base, relative to all sectors of economic activity except agriculture, forestry, fishing, public administration and services of public utility (instruction, soundness, assistance, culture, etc.) and its application in national ambit. The first phase is divided in 5 subprojects: a)definitions and classifications system of unit and characters; and its correspondence with those contained in the administrative and statistical sources that constitute data input of ASIA; b)arrangement and experiment on 3 provinces of procedures of check, code matching that lead to physical integration of administrative registers utilized - DATIN (Administrative integrated data) -; c)arrangement and experiment on 3 provinces of linkage procedures, of check of coherence, methodologies of data imputation and missing data estimation that lead to the list of elementary statistical units of ASIA; d)arrangement, forming and organization of necessary resources for the first plant of ASIA base (sources, people, informatic instruments). Predisposition of modality of filing, distribution and publication of elementary data - LISTER (territorial lists) - and aggregated - DATER (territorial statistics data) -; e) predisposition and organization of territorial permanent net (regional and provincial) necessary for the realization of surveys of check and for local data diffusion.
  • Second phase, 1996. First experimental national plant. On the base of the experiment on 3 provinces, in the second phase was pointed a complex system operating at national level; the training of personnel operating in the regional offices of ISTAT and in the provincial statistical offices of Chambers of Commerce was done (territorial check net) reaching the end of the first national experimental plant of ASIA enterprises (with identified data referred to 1995 and structural ones referred to 1994).
  • Third phase, 1997. Final national plant. In 1997 started the quality check of the first national plant of ASIA base; DATINT'95 was produced and returned to the suppliers and new data of 1996 were collected; the productive processing of the final first plant with the 1996 matched data and the 1995 stratification data, was carried out.
  • Fourth phase, 1998/99. Development. The fourth phase foresees the 1998/99 development, the check on territory of first national plant of ASIA through the Intermediate Census. Afterwards the satellite registers will be built for particular sectors (for instance trade, craft, and tourism) for which specific stratification characters, through proper sources, will be collected.

 

3. ASIA Project: the main features

3.1. ASIA as centre of the Statistical Information System

In the 1996-98 National Statistical Programme and especially in economic area, it's specified that during these three years, ISTAT will develop the "progressive passage from statistics on enterprises, actually relevant, to an integrated system on economical statistics". This means that the Statistical Information System will be structured and integrated as the complex of available information in economic field, both collected directly from the Institute through the surveys and from other sources (for instance administrative registers).

A systematic approach to the economic information collected by ISTAT can assure a growth on the quantity of information "distributed" and a better quality besides it guarantees a higher coherence among several information gathered by different surveys for the same sub-universe of statistical unit. At the same time, there is a reduction of time, costs and statistical burden when the system foresees the integration of "external" not statistical data.

Generally an information system is planned in order to let any organization (for instance: an enterprise) to manage the necessary information for its activity. We can mention SIS (Statistical Information System) when wide basis, from information sources managed by different organizations, are available only for the study of particular phenomena. In particular SIS supports the first phase of the statistical analysis, that is the collection of basic information and of the right use of meta-information. In the general meaning SIS is an integrated System because contains all the information produced and managed by several statistical surveys. Each statistical survey can be considered as an organizing structure which produces and manages independently from the others, an own base of data.

The central element of SIS is a complete and updated Register of a statistical units survey (and not necessarily the observation ones), which is both the Physical connector centre since the several typologies of gathered or acquired information are connected to the individual elementary units, and the Logical connector centre since it has the necessary meta-information (definitions, classifications) for the whole system consistency.

The Business Statistical Register (ASIA) was designed and realized in order to:

  • define an updated list of elementary units which is both the target frame and the sampling frame of different surveys on enterprises;
  • define statistical data on the economic units structure and on its evolution;
  • be the gravity centre of the Integrated Information System on economic statistics of ISTAT.

 

3.2. ASIA as a complex filing system

The target to define a statistical register of economic units, seen as the connector centre of a complex statistical information system, implies that data filing system should assure:

  • completeness: should include all the elementary units in the national territory and should guarantee for each unit, the completeness of characters (eventually through proper estimation methodologies) necessary for the whole management system;
  • reliability: should assure a right and constant updating on units and characters;
  • confidentiality: should protect the individual information during the several phases of data processing;
  • management: the completeness and entrusting could be secured if the recording and updating is limited to the only necessary and main characters;
  • flexibility: should guarantee the management of particular sub-universe defined according to the known demands of different typologies of users;
  • inexpensiveness: should avoid redundancies on data, duplications and disagreements in updating.

This means to propose a system that:

  1. defines different typologies of data for different typologies of units;
  2. separates the moment of "processing data" from the "access" on data by users;
  3. allows the management of "units partial views" and of data defined on the base of individual users' requirements.

Therefore, the System has been carried out as an integrated complex of registers set up as follows:

  1. - central, sectorial (or satellite) and survey registers,
  2. - management and dissemination registers,
  3. - national and territorial registers,
  4. - separate registers for different typologies of recorded units.

The necessity to determine the typology sub a) (schematized at Fig. 1) is due to the requirement to separate the subregisters according to the necessity of each sector to record particular stratification characters. Three central registers have been located: ASIA-enterprises, ASIA-agriculture and ASIA-institutions. The first register includes all units carrying out its main activity in the industrial and services sectors, in profit organizations (units built with a profit aim including co-operatives and unions). ASIA-agriculture includes all units carrying out its prevalent activity in agriculture, silviculture and fishing sectors. Fourthly ASIA-institutions includes the whole Public Administration, public organizations which undertake services not for sale and private non-profit organizations.

In particular the logical approach of the Asia system is the definition of the satellite registers (SR). This typology has been defined because the informative requirements for some activities sectors (or for some unit typologies) demand the recording of particular characters, different by those of central registers, with the main task of:

  • better stratify the units in order to reduce the variability in the strata,
  • better classify the units in order to ensure a major knowledge of the whole structure of a certain sector.

Within the definition and building of a SR, the filing system should be unitary through the following three inalienable conditions:

  1. all units of the satellite register will be included also in the general or central register,
  2. the unit identify code should be the same in the several register typologies where the unit is shown,
  3. the concept of information hereditary , therefore the SR collects the interesting information from the central registers since it is possible to modify them.

So a SR can be considered as a partial view enriched by the central register.

The necessity to distinguish between registers of management and registers for users rises in order to warrant the user to dispose of verified information checked at a specific date, while with regard to the need of ASIA "management", registers of management will be able to contain information at several level of updating and correctness.

The complexity of a system like ASIA, including different millions of units, requests the involvement of many organizational structures, particularly those closer to territory, that better and more directly can verify units in the competent territory and at the same time can warrant diffusion of ASIA products "personalizing them" at particular demands. In this sense the organizational structure of ASIA has been designed at 3 levels: National (Central ISTAT), regional (regional offices of ISTAT) and provincial (statistical offices of Chambers of Commerce) where at each level the activities of management and of set unit diffusion are assigned. The typology named "territorial and national registers" defines, therefore, a division of ASIA which individuates set units that can be managed only at central level (great enterprises, enterprises at national diffusion, groups of enterprises etc.) i.e. at local level.

At last it is necessary to individuate, in order to avoid redundancies, different registers' structures for different typologies of units. So, logically, the local units register is distinguished by the enterprises (the latter won't contain for instance the character address that will be recorded in the correspondent local unit defined as the enterprises' office). Logical links between different typologies of units (local units referring to the same enterprise, enterprises belonging to a same group, etc.) are warranted by physical recalls to unit codes.

.

3.3. ASIA as integration of administrative sources.

Every administrative body has its own function to collect data and manage the corresponding records, under specific legislation and rules which govern relations between various individuals and between them and the public administration. Thus, each source makes use of definitions, classifications and rules on entry and cessation that are peculiar to itself and depend on the functions of the authority concerned. The administrative body defines, classifies, collects and records information on economic subjects and their characteristics that, in the strict sense of word, do not have statistical validity. In other words using administrative data causes statisticians a problem (not of easy solution): the inconsistency of data. In a survey or inside a system for the collection of statistical data consistency is a problem evaluated ex-ante as well as it is strongly linked to the process of microdata collection and macrodata production. When we want to use data stored in non-statistical (administrative) databases, for which statisticians do not have any control of the production process, the problem of consistency is set in a different context and it is resolvable only ex-post.

So the main problem that arises through the use of administrative sources for statistical purposes is to identify the correspondences between the statistical concepts and the administrative rules and laws through which those sources observe the universe, or population, of reference. It is therefore necessary to handle in some way the administrative sources in order to align them to the statistical concept and definitions. This is possible if, at one side, we have a deep knowledge of the sources to be used and, on the other side, suitable statistical methodologies are available.

The conceptual elements previously mentioned show how, considering the use of administrative data (and their possible integration) an informatic problem tied to the treatment of large database, can lead to grave errors. Only the appropriate use of statistical methodologies can assure consistent results.

Referring to a defined statistical universe, the typologies of errors generated in the use of an administrative source for statistical purposes, can be summarised as follows:

E1 - error of under-recording a) missing-recording of legal subjects due to evasion, delays, etc...

b) unrecording of legal subject not obliged to the registration.

E2 - error of over-recording a) registration of not active legal subject due to duplications, delays of cessation recording

b) registration of legal subjects without any feature of enterprise

E3 - error assignment of characters a) wrong recording due to delays in variations acquired or to errors in declarations, in recording, in checking.

b) wrong recording due to different definitions and classifications

E4 - missing assignment of char. a) partial or total missing of attribution of a character

The integration process is useful when the considered input sources do not assure the completeness in units and in characters' units, obtaining in such way a reduction of errors of type E1 and E4. Such process is less useful for the reduction of over recording errors and of wrong character attribution. In fact using more sources can cause an increase of type E2 error; while if each source is really and considerably better of other, further information for imputation of statistical characters would cause troubles. In such case the probable gain of information obtained is tied to hypothesis of structure and quality of data on available sources and to the veracity of statistical hypothesis on which the process of conceptual and physical integration and optimisation are based.

Referring to the formal aspect of the integration process, let xi the real value of the i-ma unit related to the attribute X and xi1... x ij , ....x im are the values recorded in M available sources. The relation between the available values and the real ones can be described as follows:

x ij = g (x i , e j , e ij)

where e j is the error due to the bias (structural errors of type "b" previously described) of j-ma source, and e ij describes the random error (errors type "a").

When the knowledge of the j-ma sources is completed, it is possible to locate rules which standardise (or harmonise or normalise) the units and the variables of input source in statistical units and variables. So the standardisation function is defined as the following application

f s: X j Þ X which changes the values x j Î X j in values x i Î X i

In other words, a standardisation rule converts administrative concepts and classifications into statistical ones. This rule, generally deterministic, can be divided into three types:

  • coding rules: which convert coding (e.g. economic activity, legal form, and location) into statistical classifications (Nace, Nuts, etc. );
  • link rules: by which the different records corresponding to legal or administrative units in one source can be combined to define one statistical unit (enterprise or local unit) ;
  • conversion rules: to obtain statistical variable from administrative characters

After the standardisation on process the erratic component of the model is reduced to the random error e ij ; therefore the sources are independent and unbiased random variables with same or at least constant quality in order to adopt procedures appropriate to some experimental frameworks such as the theory of repetition of an experiment. But the situation is more complicated. The systematic errors in each sources, produced by administrative-juridical functions, are the result of two separate elements: the first tied to laws, classifications, proper definitions of the source, can be located and easily standardised; the latter, tied to elusion and evasion phenomena, can't be located and is misted inside the random error. In such way the model will change and be complex as follows:

x ij = g(x i , d j ,g i ,e ij )

where ej = h(d j ,g i ) is the result of a random component and of an unknown systematic component. The bias which weight on the definition of the input values of administrative sources is hardly quantified and well known in literature. A clear example is the trend of enterprises in some trade sectors to be classified, for fiscal facilities or credit access reasons, as manufacturing enterprises. It is also well-known the trend to subscript in the social security register, subjects not carrying out subordinate employment in order to ensure a social security insurance.

This new structure of input data discouraged the use of statistics based on linear functions of available x ij (e.g. the mean) in order to avoid distorted estimations of x i .

The statistical methodologies adopted in the imputation of characters of the ASIA register are based on the concept "choice among alternative values" and not on "combining the available values", when there are more values of an attribute for the same unit. The problem is to locate such value x ij which has a minimum value in the error h ij = f(g i ,e ij). The estimation x i of the real value taken from variable x in the fellow i has been carried out according to the following rule:

x i = x ij  : h ij < h ik " j ¹ k e con k = 1,....,m.

When there is a "reliable witness" (for instance a census data) that is without systematic errors, the problem to locate the value, among those available, with minimum e ij will be easily solved. When this won't be possible (that is the case of the first implementation of the ASIA register) it is necessary to use probabilistic methodologies based on "ad hoc" indicators of the used sources quality, or rather to use instrumental variables in order to estimate the plausibility of the various hypothesis compared. So the input procedures and the estimation characters carried out in the ASIA project utilised the previous typologies of methodologies mostly base on non-parametric models.

 

4. ASIA registers: general aspects

4.1. Units and recorded characters

The ASIA statistical units are the enterprises and the institutions economically active and its relative local units. Enterprise is the organization of an economic activity practised professionally for production of goods or supplying of sale services. An institution is a public or private organization carrying out a non-profit activity. Enterprises and institutions are active in economic sense if they practise a real productive activity, in the sense of national accounts. A local unit is the place variously denominated (factory, laboratory, shop, workshop, restaurant, hotel, office, agency, store, professional office, school, hospital, etc.) where the goods are delivered or arranged, or where services are supplied. Local unit is the place topographically individuated in a unique place (province, municipality, and section of census), which work either refers to one or more persons, possibly under part-time condition, on behalf of a same enterprise. Besides, ASIA foresees recording of other typologies of units as group of enterprises, that represents a set of enterprises (individual, partnerships, companies) legally independent, which activity is under the supervision of a unique "leader" (individual person, public institution, enterprise) and unit of economic activity: a unit that includes in one enterprise a set of parts competing the exercise of an economic activity at "class level" (4 figures) as indicated in NACE Rev. 1 Nomenclature. It's an entity corresponding to a system of information that supplies or estimates for each unit of economic activity, at least the value of production, the cost of personnel, the result of management, investments and amounting input of work.

The characters, that mark statistical units, are assembled in the following typologies:

  • identified (enterprise name, address, telephone and legal status);
  • of relation (between register units, with input sources, with surveys, with registers);
  • of stratification (location, economic activity, employees, turnover class );
  • demographic (state of activity, date of starting activity, date of ending activity, structural transformations).

 

4.2. Sources utilized

The input sources for the setting-up and updating of ASIA belonging to 3 levels of givers.

The first, is represented by the set of great administrative registers or national exaction managed by different institutions. This includes, therefore, the primary input sources of ASIA project:

  • Tax Register - TR -, managed by the Ministry of Finances, (9,5000,000 records), which gathers information on all individual and juridical persons, that present the declaration for direct or indirect taxes payment.
  • Register of Enterprises and of local units - CCIAA -, managed by the Chambers of Commerce, Industry, Craft and Agriculture (5,800,000 records), that they gather declarations of the subjects that want to undertake any productive economic activity (excluding free professionals);
  • Social Security Registers - INPS -, that records enterprises with employed for which is compulsory the payment of social security contributions, (1.700.000 records);
  • Work Accidents Insurance Register - INAIL -, registers, that records enterprises employing people for which is compulsory the payment of insurance against accidents on work (3.200.000 records);
  • Register users electric of the Electric Power Broad - ENEL -, excluding "domestic" users (4.200.000 records).

The second level of givers is of public and private institutions which manage sub-registers relative to specific well delimited sectors and therefore easily manageable. For example the Central Bank, UIC (Italian Exchange rate Office) and CONSOB (Committee of Inspection on Societies quoted), ISVAP (Committee of inspection on insurance societies) for financial sectors, Ministry of Industry and other public and private sources for the large-scale retail trade, Ministry of transports for goods transport authorizations, ENIT (National Body of Tourism) for travel agencies and hotels, etc. .

The third level of givers is constituted by all statistical surveys that ISTAT carried out on enterprises (system of enterprises accounts, survey on the gross product of small enterprises, surveys on sector of services, sample survey on domestic trade, short-term surveys, Intrastat, etc.) and that utilize ASIA as base of referring data.

With reference only to primary sources Table 1 shows the correspondence between the recorded units in the different administrative files. For each sources the columns describe: 1) individuals required to register, 2) the observed unit typology, 3) the statistical units that can be defined from those observed.

Table l - Synoptic table of units recorded at ASIA input sources

SOURCES

SUBJECTS EQUIRED TO REGISTER

OBSEVED UNITS

STATISTICAL UNITS DERIVABLE

CCIAA

Businessmen

Local units

Enterprise

     

Local unit

TR

Vat payer

Natural persons

Enterprise

   

Legal persons

Enterprise

 

Legal persons not

Legal persons

Enterprise or institution

 

payment

   

INPS

Employers

Contribution number

Enterprise

     

Provincial local unit

INAIL

Employers

Insurance number

Enterprise

     

Local unit, or group, or

     

part of local units

ENEL

Customers

Non-domestic customer

Local unit

The observed units recorded by Chambers of Commerce are the local units. A businessman (anyone that wants to begin a business activity) is required to file a declaration of start to the Chamber of Commerce of the province in which he wishes to operate. If the businessman trades in more than one province he has to make separate declarations to several Chambers of Commerce. Using the fiscal code (unique to each enterprise) it is possible to integrate the different items of information on the various local units and derive the "enterprise" as a unit.

The Tax Register records all natural and legal persons who operate in national territory and are required to comply with fiscal legislation. From this register it is possible to select four separate sub-sets of units:

1. Natural persons required making VAT returns,

2. Legal persons required making VAT returns,

3. Natural persons not required making VAT returns,

4. Legal persons not required making VAT returns.

Sub-sets 1. and 2. represent "individual concerns" and "companies". The third sub-set covers individual whose activity is not liable for VAT. The final sub-set comprises "Institutions" (including specific types of enterprises such as co-operatives and consortiums). This final sub-set requires further specification of the units, so that co-operatives and private consortiums can also be included among the enterprises.

The observed units recorded by INPS (INAIL) are "contribution numbers" (insurance numbers). An employer may get one or more contribution numbers; this may be for territorial reasons but may also relate to differing occupational qualifications of the employed personnel to be insured. It is therefore not possible to obtain information at local-unit level but it is possible (using the fiscal code) to define the "enterprise" unit.

ENEL records units defined as "non-domestic customers", which identify the physical point where the electricity customer is connected and therefore correspond to local units. But the Electric Power Board does not cover the entire national territory on its own and the "enterprise" unit cannot therefore be derived.

Table 2 analyses in outline the coverage of the ASIA input sources for the principal sectors of economic activity. The table shows as the Tax Register is the largest universe available among the various sources.

Table 2 - Coverage of the ASIA input sources by macro-sector of activity

SOURCES

Sectors of activity

 

Agricul-

ture

Industry

Craft

Trade

Services

Free

Profession

Institution

Tax Register

Full

Full

Full

Full

Full

Full

Full

Ch. of Comm.

Full

Full

Full

Full

Part

None

Part

INPS

Part

Part

Full

Full

Part

Part

Part

INAIL

None

Part

Part

Part

Part

Part

Part

ENEL

Part

Part

Part

Part

Part

Part

Part

 

4.3. ASIA products

The ASIA project foresees realization of 3 products:

- DAT. INT: register of integrated data of some administrative units;

- DA.TER: territorial register of statistical data;

- LIS.TER: territorial register of statistical units.

The problem of statistical secret is considered differently for the 3 above mentioned products:

DAT. INT: at any institution is given back its own register containing the same data supplied to ISTAT, normalized (in terms of statistical classifications), verified in their quality and integrated with information from other sources of Input (note: return of administrative normalized and integrated data is condition to lower costs of acquisition data from ASIA: cost held up suppliers would be in fact compensated by the ones held up by ASIA for the normalization and verification of quality). To the single institutions, for this reason, isn't supplied information on units not present in their registers.

DA.TER.: is constituted by statistical tables of stock and flow, produced on paper or magnetic support, of distribution of statistical units listed in ASIA according to the principal characters of stratification: employees, turnover, economic activity, juridical nature. Information will be available at region, province and commune level, like in the past, but also at census level section, since each unit will be georeferred on territory according to the road address that individuates it.

DA.TER. will be distributed to users with the same criteria adopted for result diffusion of the 1991 industrial and services census.

LIS.TER.: is formed by lists of elementary units containing identified characters (business name, address, fiscal code, etc.) and statistical characters of stratification. Lists will be covered from the statistical secret and will be at disposal of ISTAT offices and of statistical offices of the SISTAN institutions, at the exclusive aim to carry out statistical surveys. In particular any use of these data for administrative or fiscal checks is excluded.

 

5. The ASIA register production process

5.1. The basic guide lines

The process of production (and applied methodological rules) is based on some fundamental choices tied to the main conceptual and defined aspects, developed in the ASIA planning.

The legal universe of reference - The "statistical" concept of enterprise is tied to the existence of one or more legal units (the enterprise doesn't exist if it doesn't refer to at least one legal unit), a unit is considered legal when it is recognized by the Government and performs certain duties towards the State administration. The starting hypothesis is that the first act that a legal unit carries out for its activity, is the acquisition of the Fiscal Code at tax register and a legal unit can't exist without having a fiscal code, therefore the operative rule is:

the tax register is the basic universe of legal units.

The statistical universe of reference - the enterprises - All units of the legal universe of reference haven't statistical valence. In fact a legal unit cannot carry out (or has still to start carrying out or has already ceased to carry out) a productive activity and therefore isn't a real active enterprise of interest for ASIA. Since in the first implementation of ASIA there isn't "direct information" (except for the large sized units), the activity or less of a legal unit may come from the joint analyses of the input sources, so the operative rule is:

a legal unit can be considered in ASIA universe

if exists at least an indication of real activity deduced by different sources used.

The imputation of characters - The characters of identification (corporate name, address, etc.) and of stratification (economic activity, employees, etc.) for enterprises and local units, are imputed on the base of statistical and probabilistic methodologies which "select" among different values in the various sources used; the general principle referred is:

the sources have "same dignity" in imputation of the value of a character,

the choice of characters is carried out with probabilistic criteria

also utilizing statistical indicators of information quality.

In such way the used methodologies don't introduce "reasoned" preferences, that means it isn't privileged "a priori" a source against another.

 

5.2. The phases of the ASIA production process

The main production process of ASIA can be divided in seven principle macrophases.

 

The registers of bodies administration taking part in ASIA project are firstly standardised and normalised (macroph. 1) in order to allow an easier comparison on the several registers units among these and statistical concepts.

In particular, in this phase, are reclassified those characters (mainly the economic activity, localization, legal status) for which the input sources use proper classifications. The address are checked and standardized and the fiscal code possibly checked. The table 3 shows the results of the address normalized: the used procedure leads the various address acronyms to a standard acronym, assigns a "road code" and a "census area code". The reduced percentage of the assigned census area (if compared to the number of the addresses normalized) is caused by the lack of the civic number in the recorded addresses.

Tab . 3 - Normalization of the addresses

TR

CCIAA

ENEL

INAIL

INPS

Records

9,946.4

5,798.1

4,297.7

3,240.0

1,642.9

% of normalized addresses

84.8

90.3

82.6

87.8

86.7

% of assigned census area

74.5

78.2

69.0

76.7

74.2

The tab. 4 shows, instead, the results of the checks carried out on fiscal codes. For all sources, except for ENEL, the coverage of this character is about 95%.

Tab.4 - Check and revision of Fiscal Code

TR

CCIAA

ENEL

INAIL

INPS

% missing fiscal code

0.00

2.20

13.03

0.91

0.25

% wrong fiscal code

0.00

4.95

15.4

4.20

4.21

% revised fiscal code

0.00

0.10

7.69

1.13

1.65

Afterwards each unit of the ASIA register is obtained through a physical integration process of the units of each register. The process of link allows to locate the statistical units and intend to avoid the information redundancies due to the multiple references of the same unit shown in a same input archive or in more input archives.

The process of link is realised in two steps (macroph. 2 and 3 of the figure 2).

a) Construction of the groups of records through intra-archive link. Group is: the aggregate of records in an input source referring to the same entity located by the fiscal code or by the unit name in case of the fiscal c. absence. The positions in each source and referred to the same legal unit, are aggregated.

  1. Construction of the clusters of records through inter-archives link. After the reconstruction of group, inside each input source, we proceed to the pair link between the Tax Register, which is the "pivot register", and other sources. The main link-key used is the fiscal code - or part of it. Only in same cases, in particular to assure a better matching of INPS data, is used a link for private data: name and address. The inter-archives link has the capacity to "add" relevant information (localization, economic activity, size, and activity status) to the information acquired from the Tax register. The final result of this phase is the "cluster records". The cluster is: the aggregate of records in all of input sources referring to a same legal unit located by the fiscal code.

Tab. 5 shows the results obtained by the link phases

Tab. 5 - Results of the inter-archives link

Sources

Records

Groups

Matched groups (*)

Matched Records (*)

formed

Abs. value

%

Abs. value

%

CCIAA

5,798,105

4,886,217

4,554,440

93.21

5,427,839

93.61

ENEL

4,297,738

3,025,897

2,215,458

73.22

3,273,080

76.16

INAIL

3,299,906

2,625,057

2,481,182

94.52

3,141,359

95.20

INPS

1,642,933

1,497,706

1,446,143

96.56

1,587,619

96.63

The not-matched records of each source amount therefore to 4-6%, except for the ENEL source. This percentage is further reduced (on an average of about 2-3%) if the not-matched are considered relative to records of active units.

Totally clusters created by links have been about 6,000,000. The analysis of clusters (macroph. 4) is carried out before the imputation of the main statistical characters. In this phase are excluded all clusters whose combined sources give real information about either not activity of the business (before the 1995) or of activities out of the observation (agricultural, institution).

Therefore the clusters included are reduced to 4,973,593 and represent the aggregate where the probabilistic procedures of statistical characters imputation (estimation) are applied.

The tab. 6 shows the structure of clusters per number of sources.

 

Tab. 6 - The structure of clusters

Number

Sources

N of clusters

Of sources

TR

CCIAA

ENEL

INPS

INAIL

Abs. value

%

5 sources

X

X

X

X

X

640,550

12.88

4 sources

1,030,401

20.72

X

X

X

X

59,765

1.20

X

X

X

X

475,325

9.56

X

X

X

X

430,462

8.65

X

X

X

X

64,849

1.30

3 sources

1,232,391

24.78

X

X

X

457,210

9.19

X

X

X

66,517

1.34

X

X

X

671,120

13.49

X

X

X

11,827

0.24

X

X

X

25,717

0.52

2 sources

1,399,264

28.13

X

X

X

63,205

1.27

X

X

981,992

19.74

X

X

276,490

5.56

X

X

23,649

0.48

X

X

53,928

1.08

1 source

670,987

13.49

X

670,987

13.49

Total

4,973,593

3,782,941

2,021,733

1,360,824

2,425,156

4,973,593

100.00

At the end of the link process it is possible to locate the statistical units, active enterprise and local, and their statistical characteristics (macroph, 5) obtained as a synthesis of the information taken from the input sources, using probabilistic methodologies. Such a phase has also the duty to further define the universe of the register operating a new selection (in particular among active enterprise or not) in the clusters where the statistical attributes are defined.

Therefore are defined the enterprise and local units attributes (status of activity, legal form, economic activity code, employment, etc,), which are later checked throughout suitable compatibility plane allowing both the estimation of the missing data and the correction of the incorrect ones (macroph, 6). A general evaluation of the process results is determine only by statistical surveys: probabilistic sample, area sample, partial ones relative to particular sub-universes; carried out in the field (macroph, 7).

The adopted methodologies, and the results obtained in the last three macrophases are described in the next paragraphs.

 

6. The adopted statistical methodologies for the imputation of the ASIA characters

6.1. The attribution of the "activity status"

The statistical unit of the business register's interest is the enterprise. Such a unit is defined as active if in a given time period it exercises a real economic activity (using production factors as labour). The real activity must be proved by an economic evidence of activity, that means that all units legally registered but not economically active, as that has interrupted their activity but they have not legally declared ceased, they cannot fall in this definition.

The proposed methodology is to individuate objective signals of activity (i.e. electricity consumption, the realisation of value added) which through the use of some statistical methodologies can be combined in order to classify enterprises in active (or not). Basically it is assumed that: i) a unit is active as signals of activity are evident; ii) in the uncertancy it is preferred an inclusion error (to consider a unit active when it is not active) compared to that one of esclusion (to consider a unit not active when it is active).

The first step of the logical process consists in finding those signals of actual activity among characters recorded in all the administrative files. The choice of the right information on activity, when it is based on the raw data in input, is not an easy deal. Sometimes a direct information on the activity status (available only in some sources) has not the statistical meaning we are really looking for, some other times information coming from all the sources is not coherent (both inside and between sources) and a problem of choice arises. The information referring to the activity of a unit and that will constitute what we call the Indicator of actual activity is the following:

  • for ENEL, the presence or not of electricity consumption in the year;
  • for CCIAA, it is considered the payment (or not) of an annual tax
  • for INPS, the presence or not of employees during the year;
  • for INAIL, it is chosen the total amount of declared salary paid in the year;
  • for TR, the chosen information is the yearly amount of turnover.

The logit model is based upon the hypothesis of a causal relationship existing among a given choice (in our study it is a binary one) and a set of covariates x (both quantitative and categorical) considered as influencing that choice.

For a given unit i (an enterprise), for i=1...N, let xi be the vector of explanatory variables (we assume all of them as categorical) and yi the response variable such as:

yi= 1 if the event A= "the unit is economically active"

yi =0 otherwise.

For a given set of explanatory variables, let p i be the probability that event A is verified and (1-p i) the probability it is not. The model describing the behaviour of the outcome can be written as:

P(y=1/x)= F(a +b x )

If the distribution function of a logistic is chosen one can derives the following probabilities:

P(y=1/x)=e (a +b x)/ 1+ e(a +b x) = p (x)

P(y=0/x)=1-p (x)

which through the logit transformation become:

logit(p (x))=ln [ p (x)/1-p (x) ] = ln[e (a +b x) ] = a +b x

This transformation allows deriving some desirable properties of linear regression models.

Then our interest concern the link between the forecast on P(yi=1) and that one on yi=1, with the purpose to establish how large the probability p (x) has to be before one can be sure that yi=1. The calculated probabilities of activity, obtained for each unit, are equal for those units having the same configuration of activity signals. Of course more the indicator is composed by zeros (signal absence) more the probability is low. Moreover, combinations with missing sources are also taken into account. By the logit model each unit has a probability to be active. It is necessary to link this probability with the status of activity of an enterprise; results can be summarised by building up a classification table in order to compare the observed cases between the two events with the predicted ones.

As it is known, the classification of predicted values by the model into events or non-events is affected mainly by the true distribution of observed values. A very important decision, in order to correctly classify, is to define a cut-off point or confidence threshold c to which compare the estimated probabilities p i based upon the following rule:

predict yi = 1 if p i > c

0 if p i £ c

Because this decision is crucial, it is suggest taking into account the values of some tests (accuracy, specificity, and sensitivity). A good orientation could be to choice the threshold reflecting the proportion of events found in the sample or to choose that value minimising the fraction of false negative cases that ones predicted as not active but observed as active (according to the starting hypothesis).

The performance of the model was further tested on an "ad hoc" survey conducted in the Province of Cagliari. The sample has been chosen to test some procedures applied for the Asia Business register building, where selected enterprises are considered active on other bases.

Predicted

Active

Not Active

Total

Observed

Exists

2,203

53

2,256

 

Not Exists

36

9

45

 

Total

2,239

62

2,301

The table summarises that of the total enterprises 96% are corrected classified, 98% of existing units are classified as active; but 36 over 45 are not recognised as non active by the model and are classified falsely as active - they are not existing at the end of the year but they present some signals of activity during the year -. This is likely due to the fact an enterprise can present however an indicator of activity during the year even if it dies at a given time point of the same year. In general there is a systematic under-recording of cessation of activity; moreover, administrative information for some registers as we have seen from estimates are not really discriminant.

In conclusion the probabilities can be used as good indicators of a more or less strong situation of real activity during a reference time period. For classifications at precise time points further information are necessary, and current procedures make use of dates of inscription and cancellation. In the following table two configurations of enterprises distribution are presented.

 

Tab. 7 - Results of activity status imputation

State of activity at December 1995

State of activity during 1995

 

N, of enterprises

%

   

N, of enterprises

%

Not Active

1,155,568

23,2

 

Not Active

1,073,393

21,6

Active

3,818,025

76,8

 

Active

3,900,200

78,4

Total

4,973,593

   

Total

4,973,593

 

 

6.2. The attribution of the economic activity code

When there are several possible codes for the activity sector to be attributed to each single enterprise in the ASIA business register, a probabilistic choice procedure has been developed; this procedure is based on quality indicators related to the single administrative sources. The basic hypothesis is that all the sources contain errors and that the 'optimal' attribute is the onecorresponding to the minimum error probability.

This approach has the advantage to put all the registers at the same level; the quality of a single archive, relatively to the considered variable, could be quantified by the mean percentage of the errors found by comparing with all the other archives. The solution consists of calculating statistical quality indicators based on the concept of mean concordance in the information comparisons; from these measures the probabilities for each source can be obtained.

This way of doing has been extended to other attributes as the juridical status.

The single records by the administrative sources come from treatments of harmonisation and sectorial conversions from other nomenclatures, they could therefore show several possibilities for the code of activity sector; in such a context, a concordance rule has been adopted: two sources are considered as concordant if at least one of the information (activity code) contained in the first source equals at least one of these contained in the second source.

We define as 'optimal' which value that shows the minimum error probability; such probability, related to each single source, is defined by the ratio between the number of codes (of an attribute) that mis-match with the other sources and the total number of comparisons.

The evaluation of error probability, to be minimised, is therefore derived as an average of the error probabilities related to the single compared archives; such 'ex-ante' evaluation is associated to each single source. For each single unit and for each codification of the attribute (activity code) that is reported on the different sources, the 'compound' probability is computed as the product between ex-ante probabilities. If we consider only two sources, the compound probability, that is associated to a single code, is obtained on the basis of the following hypotheses and considerations:

a) The codes of two sources are coincident: Pa*Pb

b) The codes of two sources are not coincident:

b1) hypothesis - the first is correct (the second is wrong): Pa*(1-Pb)

b2) hypothesis - the second is correct (the first is wrong): Pb *(1-Pa)

these hypothesis, and the related probabilities, are presented in the following table:

 

Hypothesis:

   

Source A

 
   

Correct

False

Total

 

Correct

Pa*Pb

Pb *(1-Pa)

Pb

Source B

False

Pa*(1-Pb)

(1- Pa)*(1-Pb)

(1-Pb)

 

Total

Pa

(1-Pa)

1

The method can be extended to K sources.

The probability of error for the code recorded in the source A, considering that such a code is right, is therefore given by the product between the K probabilities of the sources, or their complement to 1, according to this code be the same or different by that data recorded in register A. In the case of hierarchical codification, as it happens for the activity sector, it is better to run the procedure for each level of the classification in order to avoid to put at the same level, for instance, mismatching at the first level (first digit) with mismatching at the last level (last digit).

In the phase of first establishment of the ASIA business register the methodology has provided results that can be resumed in the table below reported. In the first column it is shown a composition of availability (1) or not (0) of the administrative sources. The order is the same as that one presented in the following 5 columns. ISTAT source column indicates how many records were attributed using information resident in ISTAT database (Census, business structural surveys). The column with "donor" indicates how many records having missing value were attributed by applying a method called "donor". It consists in the choice of a unit donating the missing value by calculations of a minimum distance function based upon some other correlated variables.

Tab. 8 - Distribution of enterprises per sources availability and the ateco source chosen.

Availability

Source chosen

ISTAT

Donor

Total

ENEL

CCIAA

INPS

INAIL

TAX

00001

0

0

0

0

585,556

29,858

55,573

670,987

00011

0

0

0

27,394

23,058

2,633

841

53,923

00101

0

0

11,179

0

8,772

1,510

2,188

23,649

00111

0

0

16,561

12,267

29,168

5,133

76

63,205

01001

0

285,541

0

0

576,553

83,113

36,785

981,992

01011

0

77,689

0

235,016

233,725

124,555

135

671,120

01101

0

7,633

16,647

0

29,604

12,629

4

66,517

01111

0

31,860

82,238

62,625

168,973

84,765

1

430,462

10001

43,643

0

0

0

184,373

21,075

27,399

276,490

10011

4,302

0

0

5,301

13,126

2,808

180

25,717

10101

1,012

0

2,359

0

6,898

1,457

101

11,827

10111

4,885

0

7,817

3,428

38,656

9,936

127

64,849

11001

25,850

112,554

0

0

242,377

75,837

592

457,210

11011

38,968

55,317

0

93,518

191,065

96,438

13

475,319

11101

7,127

6,069

9,272

0

25,509

11,787

1

59,765

11111

44,016

39,841

89,002

78,167

238,444

151,078

2

640,550

Total

169,803

616,504

235,075

517,716

2,595,857

714,605

124,018

4,973,593

In practice, the economic activity code attribution process is carried out using information available in the ISTAT source, when it is present and is confirmed by at least one other source; otherwise the analysed methodology is used, having as basis the five administrative sources.

 

6.3. The attribution of employment.

The administrative sources of information about employment are: the Social Security Register (INPS), the Work Accidents Insurance Register (INAIL) and the Chambers of Commerce, Crafts and Industry Register (CCIAA); the three give (usually) mis-matching information about salaried employment, the second and the third give (few) information concerning independent employment; all these sources are inputted in the register ASIA.

In order to carry out processes of individual estimation, the underlying strategy is to exploit further information, about the dimensions of the enterprise, that is available for the same unit through the linking of further registers, for instance the information on the turnover amount of the enterprises, derivable by the VAT register, or data about electrical consumption by ENEL; these indicators constitute the instrumental variable.

Be N the attribute (employment) we want to estimate about a given unit j (enterprise), and be V some instrumental variable that provides information about N, and be S(i) the i-th source; N is to be determined for each unit j in the ASIA register as the value Naj. . We can represent the situation of data in the table below:

Unit

N

S(1)

,

S(i)

,

S(k)

V

ASIA

1

N1

N11

,

Ni1

,

Nk1

V1

Na1

2

N2

N12

,

#H1#Ni2#/H1#

,

Nk2

V2

Na2

,

,

,

,

,

,

,

,

,

J

Nj

N1j

 

Nij

 

Nkj

Vj

Naj

,

,

,

,

,

,

,

,

,

N

NN

N1N

 

NiN

 

NkN

VN

NaN

The column N contains the 'true' employment figures Nj, usually unknown; for each unit j the goal of ASIA is to choose Naj between the available values Nij, or to estimate it if no data is available, in order to go as close as possible to Nj .

A probabilistic approach, in the Bayesian version, consists of modelling the information about N -likely improved through the knowledge of V, that determines the transition from probability distribution P(N) to P(N/V))- with the probabilities P(N) (ex-ante information) and P(N/V) (ex-post information); the decision procedures are based on this final distribution P(N/V), intended as the status of information that is reached from P(N) by knowing the value of V. The probability distributions are based on empirical frequencies.

We formalise the two described situations, lack of information or mis-matching data for the individual units, as the "problem of estimation" and the "problem of choice" in a microdata context, even if they could be considered as two specifications (in terms of admissible decisions) of the same conceptual model; as a matter of fact, we talk about choice when a value has to be selected between a limited number of alternatives, the figures recorded by the registers; this set of alternatives is often different from enterprise to enterprise. We refer to estimation when, for a single unit, there is no indication about the plausible value of the attribute, and the choice is operated over a range of admissible values, usually the same set for the different units in strata.

The mis-matching of the registers and the lack of coverage are fortunately not complete, the estimation of P(N/V) within a sectoral strata is therefore possible on the basis of some available data; the evaluation of the underlying model N=f(V) is carried out through these available data for the problem of estimation and on the basis of the well-matching figures for the problem of choice; these procedures are stratified by activity sector and some other structural data. Data coming from different sources are considered to be coherent when they all fall in a given range.

The computation of probability distributions P(N/V) are operated by means of the Bayes' theorem: P(N/V)= P(N)P(V/N)/cost, where P(V/N) is the likelihood function and P(N) is the prior distribution for N, derived on the basis of available data.

The estimated final probability distribution, the statistic P(N/V) based on computed frequencies (empirical distribution over classes for N and V), at this stage is non-parametric,

In the case of the problem of choice, N1...Ni,, Nk are k sources that, for an enterprise j, provide the data N1j,Nij...Nkj; the value V=Vj is observed and we want to find out the true value Nj (supposed to be one of the Nij). In order to choose, for each enterprise for which information is consistent, we observe the value of V and compare the probabilities Pr(Nij=Nj / V=vj) for the different hypotheses about N.

Estimating and comparing the probabilities over the possible values of N, without considering where these data come from (the register i), is a simplification of the problem, obtained under the hypothesis that the quality Qij of the various registers is the same (for each register i and each enterprise j) and doesn't depend on N; otherwise, a quality function is to be determined in order to weigh the probabilities of the values for N before comparing them. The choice between these alternatives is therefore operated by comparing the k probabilities P(Ni/V) and choosing the most likely figure; we remind that the function P(N/V) is estimated on the basis of the data where the registers don't diverge.

For the problem of choice the prior distribution is approximated by that observed on the matching data. After the first experimentation we realised that the register are of different quality, we therefore attributed a system of weights in order to reflect such differences: of course this tends to lead the choices on the most reliable register.

The results of the application of these procedures on the ASIA business register are summarised in the following table; it shows the distribution of single choices between the registers by data availability situation (first column).

#H4#

#H4# Tab. 9 - ASIA salaried employment figures,

#H4# register choices by data availability situation

Availability

Choices

#H1##/H1#

INPS

INAIL

CCIAA

Total

#H1#3 SOURCES#/H1#

375,281

2,952

15,537

393,770

INPS / INAIL

229,157

24,037

 

253,194

INPS / CCIAA

181,175

 

7,036

188,211

INAIL / CCIAA

 

4,678

22,365

27,043

INPS

286,104

   

286,104

INAIL

 

113,044

 

113,044

CCIAA

   

26,961

26,961

TOTAL

1,071,717

144,711

71,899

1,288,327

The goal of the estimation problem is the definition of the final distribution (over the values for N) of the missing data, that is Pm(N/V), in an admissible region for N, say a range (N1, N2). Once estimated the functions P(N/V) " V, for a single enterprise j, given V, Nj* could be selected as the most probable value for N, or it could be randomised on the basis of that distribution. The distributions are estimated on the basis of available individual data; if more than one figure is available (two sources are used for independent employment) we take the average as individual value; in the application (ASIA) the initially available figures are 50% of the total, these provide information for the estimation process over the remaining units.

A procedure for the approximation of the prior distribution Pm(N) (based on the instrumental variables as well) has been developed for the problem of estimation, for which the missing data could not have the same distribution of these are available; this takes information from the distribution of the instrumental variables over the data for which N is missing.

The results of the application of these procedures on the ASIA business are summarised in the following table; it shows the distribution of individual attributions by data availability.

#H4#Tab. 10 - ASIA independent employment figures:

#H4# register choices by data availability situation

#H3#Available Sources#/H3#

Choices

#H2#Only INAIL#/H2#

633,902

Only CCIAA

1,105,423

INAIL and CCIAA

708,324

Estimation

2,525,944

Total

4,873,593

These procedures work with (many) small enterprises; large units are regularly surveyed by ISTAT.

 

6.4. The final results obtained

The input data process of the main stratification characters allows to define the universe of ASIA register defining if the unit is active or not (and the date of birth and cessation) which economic activity carries out, if it is an enterprise or a institution, its location and size. In this way an aggregate of 3,730,148 enterprises active till 31.12.95 is determined from almost 5 million of the clusters processed through the above described methodologies

Tab. 11 - The definition of the

ASIA active enterprises

Cluster tipologie

N of

clusters

Active enterprises

3,730,158

Not active enterprises

1,101,623

Institutions

80,469

Agricultural enterprises

61,337

Total clusters

4,973,587

On this aggregate of enterprises other checks of quality are defined and the control on territory is carried out through statistical survey

 

7. The ASIA information quality.

7.1. The check plan of the information quality.

The last phase of work before the final release of the ASIA register concerns the check plan of information's compatibility. The latter is necessary because the complex probabilistic methodologies, which allowed to locate the variables according to the information from the input archives, have been applied to all units and also when the attribution of each variables has been based on scarce primary information.

All the attributes of each unit in the ASIA register give a further enrichment to the information, on which it's possible to redefine one or more characters starting from those obtained with a relative higher probability of error.

1st phase: comparison of ASIA register information with those of the Sirio-NAI register.

The Sirio-Nai archive of middle-large sized enterprises, realised by ISTAT on the base of information obtained by the economic census (1991), has information yearly updated through several surveys addressed to enterprises. Such archive covers the statistical requirements and constitutes an important source for verification of the main variables in the ASIA register.

2nd phase: correction of ASIA with the information SCAP.

The second phase foresees an improvement of some characters in the ASIA register through a linking of a tank (named SCAP) containing corrections of the error located on the base of the previous versions of the register. Such system is going to safeguard in the future positions already verified since the annual regeneration of ASIA could have the same errors already located and corrected.

3rd phase: check among some key-word contained in the enterprise name and the economic activity and legal status.

Since in same cases the enterprise name has clear indications on the unrolled activity, some key-words have been located (for instance "pharmacy", "restaurant', "butcher's shop", etc,) with one or more correspondent economic activity codes. In some cases it is possible to correct automatically the classification error, in other it is necessary a manual check.

4th phase: check of the coverage of some economical sectors through sectorial-archivies.

For some sectors of particular interest has been gathered information from the corporations, administrations or associations, which checks the enterprises operative in the relevant sector. In this way such enterprises in ASIA are checked with higher precision. The main sectors having sectorial archives supporting the statistical register are as follows: financial intermediation, insurance, municipal enterprises, large-scale retail trade, state-controlled enterprises.

5th phase: check of coverage.

The check of coverage is carried out comparing the information contained in the ASIA with the ones obtained by the 1991 Industrial and Services Census (CIS). The CIS used almost 100,000 enumerators of the whole Italian road network and therefore realised with a different method and so independent from ASIA register. This coverage check is considered the time-leg between the two "surveys". For both surveys the same typologies of tables have been produced pointing out the cells with high overcoverage or undercoverage. The tables whom comparison is carried out are the following: i) enterprises and employees for province, economic activity (five figures) and eleven classes of employment, ii) enterprises and employees for province and municipality, economic activity (two figures) and two classes of employment. For each typology of tables and for each combination of variables modalities the percentage of difference between ASIA and CIS data are calculated. For each code of economic activity and employees' class are calculated the median and the interquatirle difference considering provincial (municipal) data as an individual observation. So are under verification all units contained on the cells, where the following doesn't happen:

Mejk - n (Q3jk - Q ljk) < Pijk < Mejk + n (Q3jk - Q ljk)

where: p = percentage difference i = province and/or municipality

Me = median j = economical activity code

Q3-Q1 = interquartile difference k = class of employees

N = full number

6t phase: rules of compatibility.

In this phase all enterprises, including those modified in the previous phases are verified through compatibility rules (130 rules) in order to check possible incongruities among the main variables of the same unit or checking the relation between different record (enterprise - local units). The software GRANADA (Gestione Regole per l'ANAlisi dei Dati) is used for the management of data check and it has been prepared on purpose in ISTAT, in order to optimise the definition of the rules, the location of anomalous records and the management of the deterministic or stochastic corrections. The compatibility rules have been defined on the base of comparison among the values occurred for the main variables. Each compatibility rule is divided in two big categories: errors and check. The first category refers to rules where data correction is necessary while the latter involves the check of data considered anomalous for which it is possible both confirmation and correction. The rules are further divided in three typologies on the base of the foreseen modality of correction: deterministic imputation, stochastic imputation and manual correction. The first two typologies are reserved to the rules which point out the errors, while the third is foreseen both for errors revisited manually, and for all the checks where the verification of data not excluding their confirmation. In the stochastic case the RIDA (Ricostruzione dell'Informazione con Donazione Automatica) imputation procedure is applied. Such procedure is based on the principle of similitude on behaviours: the missing data (or the correction of error data) is attributed locating a "donor" unit, with complete and exact information chosen among the ones with the same features of the selected unit. The variables determine the similitude among the units are located on the base of the independent test (chi-square) between the variable with missing (error) data and the possible ones contained in the register. Calculating a function of mixed distance where each normalised distance is considered (with weights given by the test values), the donor unit is located corresponding to the minimum mixed distance from the unit with the missing (error) information.

About 5,8% of the enterprises of ASIA registers have been involved in the check phases previously described. The following table shows the temporary results for each six phases of check.

Tab. 12 - ASIA Check plan results

Phases of the check plan

N of the enterprises involved

Comparison of ASIA reg. information with those of Sirio-NAI,

41,456

Correction of ASIA with the information SCAP

26,698

Check among key-word in the enterprise name and the econ. activity

17,743

Check of the coverage of some econ. sectors through sectorial archives

5,274

Check of coverage with the 91' CIS data

6,680

Rules of compatibility

115,684

   

Total

213,535

Each phase of the check procedure is kept under control by flags, which allow at the end of the ASIA production process a really deep quality analysis.

7.2. The check on territory through the 1996 intermediate census

The statistical register of active enterprise (ASIA) is a census picture of the economic units operating on territory created by integration of administrative sources and updated every year,

The first implementation of ASIA register needs completion and verification on territory in order to check all information registered and to test the methodologies adopted,

In the previous census the traditional surveying technique was to locate physically the local units of the economic bodies on territory. Through the 1996 intermediate census the units surveying is carried out on several ways, according to the different external nets to which the enterprise is connected for administrative implementations (fiscal, security, etc,) and for the access to public services (electric net),

The intermediate census has the duty to check the information on economic active units and also to carry out a survey which supplies a wide information on the economical structure of the country and on the relations among the economic operators,

ISTAT decided to organise two different surveys that are deferred in the time. Through the first, short-form survey, the operations of the referred universe are ended, Through the second survey, long-form, information on the structure and organisation of units are gathered. The units are checked through four different techniques, in order to reduce the statistical burden and the surveying costs:

  1. desk check for the units with certain information on the eligibility and on characters
  2. postal questionnaire for the units with uncertain information and for those considered very important for the register (medium-large enterprises and plurilocated enterprises)
  3. interview by phone for the non respondent units to the postal questionnaire
  4. interview by the surveyor on territory for the non-respondents units neither to the postal questionnaire nor to the telephone interview,

The short-form questionnaires relate to the already printed characters surveyed by ISTAT through the integration process of the information of administrative registers. For each printed character there's an empty space where the unit surveyed can correct possible errors or write possible variation (and the date of such variation) to report in the ASIA register. The short form surveys, compare to the traditional census, should give the following advantages:

  • a major coverage, thanks to the previous integration of administrative data
  • a minor burden for the enterprises, thanks to the use of innovative techniques of survey and check;
  • a major timeliness, thanks to the telematic techniques of survey, check and data dissemination,

The short-form survey has involved about 532,000 enterprises; the final results will be released at the end of October (a year later from the beginning of the survey). It is already possible to analyze the first data, relative to 364, 279 recorded and checked models, in order to try a firstly analysis of quality of ASIA register.

The tab. 13 shows the distribution of the concordances/discordances among the values of the main characters contained in ASIA and those collected by the short-form survey. The results showed are indicative and need some specifications:

  1. some discordances are determined from "variation" of the characters in the last period (see tab. 14);
  2. the high percentage of discordances on the address is caused by partial modification (lack or error of the civic number)
  3. for the variables "employees" since it is surveyed referring to two different year (1995 and 1996), the concordances/discordances are defined on the base of a range of admissibility.

Finally the tab.15 shows, concerning the economic activity, the distribution of the "wrong" discordances per wrong digits.

 

Tab. 13 - ASIA/intermediate Census

Concordances and discorbances per main characters

Characters

Concordance

Discordance

N

%

N

%

Activity status at the 31.12.95

340,808

93.6

23,471

6.4

Economic activity code

322,844

88.6

41,434

11.4

Enterprise name

346,865

95.2

17,413

4.8

Address

288,278

79.1

76,000

20.9

Juridical status

350,453

96.2

13,825

3.8

Employment

317,905

94.0

20,351

6.0

 

Tab. 14 - ASIA/intermediate Census

distribution per discordances typology

Characters

Variation

Wrong

N

%

N

%

Economic activity code

6,983

16.9

34,451

83.1

Enterprise name

10,255

58.9

7,145

4.1

Address

24,224

31.9

51,776

68.1

Juridical status

7,016

50.7

6,809

49.3

 

 

 

Tab. 15 - ASIA/intermediate Census

wrong economic activity code

distribution per wrong digits

Wrong Digit

Discordances

N

%

first digit

12,672

36.8

second digit

4,477

13.0

thirty digit

9,020

26.2

fourth digit

4,256

12.4

fifth digit

4,026

11.6

34,451

100.0

 

Utilised references

Par. 1- 4

ABBATE C, e GAROFALO G, (1997), "Use of integrated administrative sources in order to improve the quality of business register statistics", EUROSTAT, proceeding of international workshop: "Use of administrative sources for statistical purposes" Luxembourg.

ABBATE C, e GAROFALO G, (1998), "Recent innovation in business register and sample survey on enterprises", IASS/IAOS Conferences, Aguascalientes, Mexico.

ISTAT (1998) "L'impianto normativo, metodologico e organizzativo del Censimento Intermedio dell'Industria e dei Servizi", cap. 3.

Par. 5

ABBATE C,, GAROFALO G, e RUNCI C, (1996), "Il processo di produzione del primo impianto di ASIA - Imprese", ISTAT, Mineo, Roma.

Par. 6

CARONE A,, VIVIANO C, (1998), "Imputation of statistical attributes in a business register using integrated administrative sources: methodologies adopted for the Italian statistical register", Joint UNECE/EUROSTAT seminar on business registers, Genevre.

Par. 7

ABBATE C, e GAROFALO G, (1998), "Recent innovation in business register and sample survey on enterprises", IASS/IAOS Conferences, Aguascalientes, Mexico.

ISTAT (1998) "L'impianto normativo, metodologico e organizzativo del Censimento Intermedio dell'Industria e dei Servizi", cap. 4-5.

 

General references

ABBATE C, (1995), "Una metodologia per la definizione ottimale degli attributi", in "Verso un sistema statistico integrato delle imprese in Europa", Franco Angeli, Milano.

ABBATE C, (1996), "La completezza delle informazioni, L'imputazione da donatore con distanza mista minima, - Il prodotto RIDA", Quaderni di ricerca, n 4 ISTAT, Roma.

GAROFALO G, e REVELLI R, (1996), "Le relazioni dinamiche tra le unità: definizioni. tecniche di rilevazione e implicazioni micro economiche", SIS, Atti della XXVIII riunione scientifica, Rimini.

MARTINI M, e AIMETTI P, (1989), "Un archivio delle Imprese per l'analisi economica, Fonti, metodi risultati", Unioncamere, Milano.

MARTINI M, e BIFFIGNANDI S, (1995), "Verso un sistema statistico integrato delle imprese in Europa", Franco Angeli, Milano.

EUROSTAT (1996), "Raccomandation manual business register".

EUROSTAT (1996), "Le repertoires statistiques d'enterprises: problemes et possibilites" Actes de la 82 conference des DGINS, Vienna.


Home page of 12th International Roundtable on Business Survey Frames - Home page of Statistics Finland




 

Printable version