Home

ISLAMIC MEDICAL EDUCATION RESOURCES-05

0803-Epidemiological Research Based On Large Data Analysis: Study Characteristics

Enter subhead content here

Abstract

The paper shows a tendency to publish large size studies based on previously or routinely collected data. Analysis of studies published in one volume of each of 3 major epidemiological studies revealed median study sizes above 1000 for all types of study design, data collection, and study population. Cross sectional and follow up studies had the highest median study sizes when they were based on previously and routinely collected data. The paper discusses some of the problems associated with large studies.

-----------------------------------------------------------------------------------------------------------

 

Introduction

This study was motivated by the observation that recent epidemiological research is based on existing large data-bases or defined cohorts and that 100% sampling was the usual practice. This is a reversal of the traditional epidemiological practice of selecting a small probability sample from a study population in order to reach conclusion about the target population[1].

 

Developments in information technology and mass access to the internet opened up new fields of endeavor for the epidemiologist. For example data could be collected from a large number of people using internet-based questionnaires [2]

 

The objective of the research was to survey epidemiological research published in 2006 in three high-impact journals to make a statistical description of the characteristics of these studies. The three journals selected for study were the American Journal of Epidemiology 2006 Volume 169 Nos. 1-12, The International Journal of Epidemiology 2006 Volume 35 Nos. 1-4, and The European Journal of Epidemiology 2006 Volume 21 Nos. 1-9.

 

Methods

The study included only original papers that involved raw data. Reviews, meta-analyses, and analyses based on published data were excluded. A pre-tested data abstract form was used to abstract the following essential information from each original research article: title, authors, issue and volume number, date of publication, type of study (cross sectional, case control, cohort, randomized community control, randomized clinical), Study population (general population, defined population, ongoing study), type of data collection (routinely collected data, newly collected data, previously collected data or a combination among the above) and total number of study subjects (number recruited before any exclusions). Defined populations were hospitals, health insurance of health maintenance organizations, clinics, schools, factories, and ongoing studies. For case control studies cases and controls were added up. For ongoing studies it was assumed that data collection was new unless a special mention was made of using previously collected data. The data was keyed into an SPSS data base for categorical analysis using the chi-square test statistic to test for association. The Kruskall-Wallis non-parametric test statistic was used to compare study size by various study designs and types of data collection.

 

Results

A total of 137 studies were analyzed. Table 1 shows a significant variation in journal preference for study designs and methods of data collection. There was however no significant variation among journals in the choice between defined and general study populations. Table 2 shows that the median study side was over 1000 for all journals and types of study design. Its variation among journals was significant for follow up studies and not cross sectional and case control studies. The median study size did not vary significantly among the 3 journals for different study populations and methods of data collection. Table 3 shows that cross sectional and follow up studies had significantly higher median study size in general populations than in defined populations. No such significant variation was seen on case control studies. Table 4 shows that follow up studies had higher median study size if based on previously and routinely collected data. No such significant variation was observed for cross sectional and case control study designs. Table 5 shows significant variation of median study size with the type of data collection. Study size in defined populations was highest for newly and routinely collected data whereas in the general population median study size was highest for previously and routinely collected data.

 

Discussion

Median study sizes were highest for cross sectional and follow up studies and when based on previously or routinely collected data. This was due to availability of large data bases with routinely or previously collected information. The availability of large data bases and high speed computers encouraged epidemiologists to analyze data without probability sampling. Several issues arise in discussing large size studies [1]. A large data could give very stable parameters but the same degree of precision could have been obtained from a smaller sample. What was the lost was the ability of the epidemiologist to inspect a small manageable data set, internalize it, and let his intuition act before the data is analyzed. The more intimate contact of the epidemiologist with the data traditionally accounted for deep understanding and discussion which are missed in the new trend. Easy availability of large databases also encouraged epidemiologists to plunge into data analysis before serious thought about the research questions. In some cases the research questions could be prompted by preliminary analysis which can lead to numerous biases. Use of large data sets had the advantage of external validity which had never been the primary objective of epidemiological research. Epidemiologists traditionally aimed at carrying out a small study based on probability sampling so that they could easily identify and control confounding and other sources of bias with the ultimate aim of internal validity. They knew that external validity (generalization) would be attained inductively by consideration of several studies that are internally valid. Use of large sets of routinely collected data also raised the issue of the quality of the data which is collected with service and administrative and not research considerations in mind.

           


TABLE 1: CLASSFICATION OF ARTICLES BY JOURNAL

Study Characteristics

IJE

EJE

AJE

Total

X2(df)

P valuea

n

(%)

n

(%)

n

(%)

Study design

Cross sectional

16

(39.0)

33

(60.0)

13

(31.7)

62

10.10 (4)

0.039

Case control

9

(22.0)

9

(16.4)

7

(17.1)

25

Follow up

16

(39.0)

13

(23.6)

21

(51.2)

50

 

 

 

 

 

 

 

 

 

 

 

Study population

Defined population

12

(29.3)

23

(41.8)

22

(53.7)

57

5.02 (2)

0.081

General population

29

(70.7)

32

(58.2)

19

(46.3)

80

 

 

 

 

 

 

 

 

 

 

 

Data Collection

Newly & routinely

6

(14.6)

5

(  9.1)

3

(  7.3)

14

 

0.030b

Newly

13

(31.7)

28

(50.9)

25

(61.0)

66

 

Previously

0

(  0.0)

1

(  1.8)

3

(  7.3)

4

 

Routinely

22

(53.7)

21

(38.2)

10

(24.4)

53

 

a Chi-square test;  b Fisher’s exact test;

IJE = International Journal of Epidemiology

EJE = European Journal of Epidemiology

AJE = American Journal of Epidemiology

 


TABLE 2: NUMBER OF RESEARCH SUBJECTS BY STUDY CHARACTERISTICS AND JOURNAL

Study characteristics

Journal

n

Number of research subjects

X2.(df)

P valuea

Median

Min.

Max.

Study design

 

 

 

 

 

 

 

Cross sectional

IJE

16

7,183

107

212,467,094

0.63 (2)

0.730

EJE

33

4,599

112

36,000,000

AJE

13

2,255

139

212,467,094

 

 

 

 

 

 

 

 

Case control

IJE

9

1,263

288

166,310

1.61 (2)

0.446

EJE

9

1,051

372

2,222,404

AJE

7

1,875

730

212,467,094

 

 

 

 

 

 

 

 

Follow up

IJE

16

60,925

1,016

60,000,000

7.80 (2)

0.020

EJE

13

9,778

34

11,000,000

AJE

21

2,446

209

212,467,094

 

 

 

 

 

 

 

 

Study population

 

 

 

 

 

 

 

Defined population

IJE

12

6,381

726

246,146

2.47 (2)

0.291

EJE

23

1,272

34

6,240,130

AJE

22

2,102

299

1,299,177

 

 

 

 

 

 

 

 

General population

IJE

29

14,495

107

212,467,094

1.36 (2)

0.506

EJE

32

7,404

188

36,000,000

AJE

19

10,932

139

212,467,094

 

 

 

 

 

 

 

 

Data Collection

 

 

 

 

 

 

 

Newly & routinely

IJE

6

2,380

274

246,146

0.75 (2)

0.686

EJE

5

11,081

1,051

2,222,404

AJE

3

11,234

1,068

212,467,094

 

 

 

 

 

 

 

 

Newly

IJE

13

3,290

288

14,495

3.90 (2)

0.142

EJE

28

937

34

6,240,130

AJE

25

2,010

139

21,610

 

 

 

 

 

 

 

 

Previously

IJE

-

-

-

-

0.20 (1)

0.655

EJE

1

56,214

56,214

56,214

AJE

3

1,516

619

212,467,094

 

 

 

 

 

 

 

 

Routinely

IJE

22

180,155

107

212,467,094

2.43 (2)

0.296

EJE

21

19,801

212

36,000,000

AJE

10

399,910

1,832

212,467,094

a Kruskal-Wallis Test

IJE = International Journal of Epidemiology

EJE = European Journal of Epidemiology

AJE = American Journal of Epidemiology


TABLE 3: NUMBER OF RESEARCH SUBJECTS BY STUDY DESIGN AND STUDY POPULATION

Study design

Study pop.

n

Number of research subjects

Z

P valuea

Median

Min

Max

 

 

 

 

 

 

 

 

Cross sectional

Defined pop.

22

2,222

112

6,240,130

-2.28

0.023

General pop.

40

10,832

107

212,467,094

 

 

 

 

 

 

 

 

Case control

Defined pop.

10

1,552

726

2,222,404

-1.00

0.318

General pop.

15

1,083

288

212,467,094

 

 

 

 

 

 

 

 

Follow up

Defined pop.

25

2,311

34

1,299,177

-3.77

<0.001

General pop.

25

83,875

188

212,467,094

a Mann-Whitney test


TABLE 4: NUMBER OF RESEARCH SUBJECTS BY STUDY DESIGN AND TYPE OF DATA COLLECTION

Study design

Data collection

n

Number of research subjects

X2.(df)

P valuea

Median

Min

Max

 

 

 

 

 

 

 

 

Cross sectional

New & routine

3

7,000

274

11,193

7.49 (3)

0.058

Newly

33

2,650

112

6,240,130

Previously

3

1,516

619

212,467,094

Routinely

23

95,000

107

212,467,094

 

 

 

 

 

 

 

 

Case control

New & routine

7

1,678

1,051

212,467,094

2.86 (2)

0.240

Newly

13

909

288

4,778

Previously

-

-

-

-

Routinely

5

1,272

828

166,310

 

 

 

 

 

 

 

 

Follow up

New & routine

40

15,127

11,081

246,146

29.53 (3)

<0.001

Newly

20

995

34

11,267

Previously

1

56,214

56,214

56,214

Routinely

25

87,922

212

212,467,094

a Kruskal-Wallis Test

 


TABLE 5: NUMBER OF RESEARCH SUBJECTS BY STUDY POPULATION AND METHOD OF DATA COLLECTION

Study population

Data collection

n

Number of research subjects

X2 (df)

P valuea

Median

Min

Max

 

 

 

 

 

 

 

 

Defined population

New & routine

7

11,234

7,000

2,222,404

12.10 (2)

0.002

Newly

39

2,010

34

6,240,130

Previously

-

-

-

-

Routinely

11

1,272

212

1,299,177

 

 

 

 

 

 

 

 

General population

New & routine

7

1,263

274

212,467,094

27.47 (3)

<0.001

Newly

27

1,653

139

47,859

Previously

4

28,865

619

212,467,094

Routinely

42

187,530

107

212,467,094

a Kruskal-Wallis Test

 

REFERENCES

1. Omar Hasan Kasule. The transition from sample to population epidemiology. Journal of the University of Malaya Medical Center (in press)

2. Alexandra Ekman, Paul W Dickman, Asa Klint, Elisabete Wederpass, and Jan-Eric Litton. Feasibility of using web-based questionnaires in large population-based epidemiological studies, European Journal of Epidemiology 2006; 21: (2): 103-111.

ŠProfessor Omar Hasan Kasule, Sr. March, 2008