|GENITOURINARY - ORIGINAL ARTICLE
|Year : 2011 | Volume
| Issue : 4 | Page : 488-495
Interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists
RV Singh1, SR Agashe2, AV Gosavi2, KR Sulhyan2
1 Department of Pathology, Tarini Cancer Hospital, Alwar, Rajasthan, India
2 Department of Pathology, Govt. Medical College, Miraj, Maharashtra, India
|Date of Web Publication||25-Jan-2012|
R V Singh
Department of Pathology, Tarini Cancer Hospital, Alwar, Rajasthan
Source of Support: None, Conflict of Interest: None
Context: Gleason grade is the most widely used grading system for prostatic carcinoma and is recommended by World Health Organization. It is essential that there should be good interobserver reproducibility of this grading system as it has important implications in patient management. Aim: To assess interobserver reproducibility of Gleason grading of prostatic adenocarcinoma. Design: A total of 20 cases of prostatic adenocarcinoma were scored using Gleason grade by 21 general pathologists. The scores were then compared using κ-coefficient and consensus score. Results: For Gleason score groups (2-4, 5-6, 7 and 8-10) overall agreement with consensus score was 68%. Exact agreement for Gleason scores with consensus score was 43.3% and 92.3% within ±1 of the consensus score. κ coefficient for primary grade ranged from -0.32 to 0.92 with 60% of the readings in fair to moderate agreement range; and for secondary grade κ ranged from -0.30 to 0.62 with 78% of the readings in slight to fair agreement range. Kappa for Gleason scores ranged from -0.13 to 0.55 with 80% of the readings in slight to fair agreement range; and for Gleason score groups κ ranged from -0.11 to 0.82 with 68.5% of the readings in fair to moderate agreement range. Conclusions: In our study interobserver reproducibility of Gleason scores among general pathologists was at lower level and it highlights the need to improve the observer reproducibility by continuous educational sessions and taking second opinion in cases where grade could significantly influence management.
Keywords: Adenocarcinoma, Gleason score, interobserver reproducibility, prostate
|How to cite this article:|
Singh R V, Agashe S R, Gosavi A V, Sulhyan K R. Interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists. Indian J Cancer 2011;48:488-95
|How to cite this URL:|
Singh R V, Agashe S R, Gosavi A V, Sulhyan K R. Interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists. Indian J Cancer [serial online] 2011 [cited 2019 Aug 19];48:488-95. Available from: http://www.indianjcancer.com/text.asp?2011/48/4/488/92277
| » Introduction|| |
Carcinoma of the prostate is the most common form of cancer in men and the second leading cause of cancer-related deaths.  The American Cancer Society estimated 218,890 new cases of adenocarcinoma of the prostate in 2007, which well surpassed lung cancer as the most prevalently diagnosed carcinoma in men.  Subjectivity involved in histological grading is mainly responsible for limiting the utility of various grading systems. The inconsistency in histological grading may invalidate its use in treatment decision. Thus reproducibility of histological grading has the same significance as the predictive character of prognosis.
Gleason grading is the most widely used, and recommended, grading system for prostatic carcinoma in the world today. Even then it is not foolproof method and does not carry 100% reproducibility.
To date many authors have examined the interobserver and intraobserver reproducibility of Gleason grading on radical prostatectomy specimens or on prostate biopsy specimens. ,,,,,, The aim of current study is to assess interobserver reproducibility of Gleason grading of prostatic adenocarcinoma among general pathologists.
| » Materials and Methods|| |
Twenty Hematoxylin and Eosin-stained glass slides of prostatic adenocarcinoma were randomly retrieved from the histopathological files on the basis of the original diagnosis. Out of these 20 slides, 10 were from needle biopsy, eight were from Transurethral Resection of Prostate (TURP) specimens and two were from prostatectomy specimens. The slides were selected to roughly represent the spectrum of Gleason scores and no effort was made to select particularly difficult cases. The slides were of uniform and adequate quality, and reproducibility with respect to variation in the quality of slides was not studied. All slides were coded to ensure that they could not be identified by the pathologists.
Our study welcomed 21 general pathologists from a teaching institute. All the pathologists were given code numbers randomly from P1 to P21 to maintain anonymity. A seminar was carried out on Gleason grading based on current practice of this grading system identified from recent publications before the start of the study. ,, For instruction and /or review in how to use the Gleason grading system, a written description of this grading system along with colored photomicrographs of different Gleason patterns accompanied the slides. Proforma for reporting Gleason grade were circulated and each pathologist was asked to assign Gleason score based on primary and secondary pattern, and tertiary pattern of higher grade than the secondary pattern if present.
The interobserver agreement for each pathologist was compared in pairwise using algorithm for a simple κ-coefficient. , Simple κ is a measure of interobserver agreement, given by Cohen. When the observed agreement exceeds the chance agreement the κ-coefficient is positive, with its magnitude reflecting the strength of the agreement. Strength of κ agreement is as follows:
κ statistics - Strength of agreement
< 0.00 - Poor
0.00-0.20 - Slight
0.21-0.40 - Fair
0.41-0.60 - Moderate
0.61-0.080 - Substantial
0.81-1.00 - Almost perfect
Kappa was calculated for interobserver agreement for each of the pathologists compared pairwise for Primary grade (1-5), Secondary grade (1-5), Gleason score (2-10), and Gleason score groups (2-4, 5-6, 7 and 8-10). The groupings of the Gleason scores were chosen to reflect those employed in patient management and were the same as those used elsewhere.  To calculate κ coefficient help of a biomedical Statistician was taken.
To assess percentage interobserver agreement mathematical consensus score for each slide was calculated. , For each slide, the median of the panel's readings was first calculated separately for the primary and secondary grade. These values were then summed up to calculate the mathematical consensus score for each slide. The total number of Gleason scores recorded by each pathologist, which agreed with the consensus for each slide, was expressed as a percentage of the total number of slides read i.e., exact agreement.
| » Results|| |
Gleason scores determined by 21 pathologists were 4-10. No 2 or 3 score was assigned to any slide. Gleason score 7 was assigned maximum number of times (137/420; 32.6%) and Gleason score 4 was reported least (2/420; 0.5%).
[Table 1] shows interobserver agreement for Gleason score groups with the consensus score groups. For Gleason score groups (2-4, 5-6, 7and 8-10) the maximum number of readings were in 8-10 group (229/420; 54.5%) and least in 2-4 groups (2/420; 0.5%). Using the score groups, the overall percentage agreement for the panel of pathologists with the consensus score groups was 68.0% and ranged from 42.9% to 85.7%.
|Table 1: Interobserver agreement for observers readings with the consensus score groups (2-4, 5-6, 7, 8-10)|
Click here to view
[Table 2] shows percentage agreement for Gleason scores with consensus scores and agreement within ±1, ±2 and ±3. The distribution of the difference between each reading and the consensus score for each slide was calculated. For 43.3% of the readings there were exact agreement with the consensus Gleason scores and for 92.3% there were agreement within ±1 of the consensus score and 99.1% agreement within ± 2 of the consensus Gleason scores. These results varied between individual consensus scores, the percentage being lower for consensus scores 6 (35.7%) and 8 (34.3%) and higher for consensus scores 7 (65.1%) and 9 (71.4%). No slides had consensus score 2, 3, 4 or 5. Undergrading were seen in 23% and overgrading in 33.7% of the readings.
[Table 3] and [Table 4] shows κ agreement for Primary grade, Secondary grade, Gleason scores and Gleason score groups. Kappa interobserver agreement for Primary grade when each of the pathologists was compared with each other ranged from -0.32 to 0.92 with majority 60% (254/420) of the readings in fair to moderate agreement range [Table 3] and 11% (46/420) showed poor agreement and 1% (4/420) showing almost perfect agreement.
|Table 3: Interobserver κ agreement for primary grade (1-5) and secondary grade (1-5), 21 pathologists, 20 cases|
Click here to view
|Table 4: Interobserver κ agreement for Gleason scores (2-10) and Gleason score groups (2-4, 5-6, 7, and 8-10), 21 pathologists, 20 cases|
Click here to view
κ-interobserver agreement for Secondary grade ranged from -0.34 to 0.62 with majority (78%; 328/420) of the readings in slight to fair agreement range [Table 3] and 14% (58/420) showed poor agreement and two readings out of 420 (0.5%) showed substantial agreement.
κ-interobserver agreement for Gleason scores [Table 4] ranged from -0.13 to 0.55. In 4% (18/420) κ were less than 0 (poor agreement) and in 80% (336/420) slight to fair agreement was seen. No reading with substantial or perfect agreement was seen.
For Gleason score groups k ranged from -0.11 to 0.82 [Table 4]. Kappa for 10 out of 420 (2%) readings were less than 0 (poor agreement) and in 68.5% (288/420) there were fair to moderate agreement. Only two (0.5%) readings showed almost perfect agreement.
As we see from above, κ agreement for Gleason score groups is marginally better than Gleason scores and that for primary grade is better than secondary grade. Majority of the readings were on slight to fair agreement range. Almost perfect agreements were achieved only for Primary grades (1%) and Gleason score groups (0.5%).
| » Discussion|| |
To be clinically useful, a histopathological grading system must provide significant prognostic information, be reasonably easy to use and reproducible. Grading prostate cancer is particularly difficult because of the pronounced morphological heterogeneity of this tumor. Inevitably any grading system is flawed by some degree of interobserver and intraobserver variability. Interobserver and intraobserver reproducibility of the Gleason grading system has been studied by several groups. Comparisons of these studies is hampered by a number of factors, including, 
- variable numbers of participants;
- in some instances, standardized review of Gleason grading before participation in the study;
- differing type of specimens;
- variable number of slides evaluated;
- different criteria for the "true" Gleason scores; and
- different groupings of scores.
One of the problems in evaluating interobserver agreement in surgical pathology is establishing the "true" diagnosis. A number of methodologies can be used, including calculating percent exact agreement between pairs of observers, between observers and an expert diagnosis, and between observers and a consensus diagnosis.  We have compared the readings of pathologists against mathematical consensus score and against each other using algorithm for a simple k-coefficient. The 'κ' statistics is important for reproducibility study as it corrects for chance agreement, which occurs when concordance only is evaluated.  The κ-value generally increases when the number of categories decreases, as dose concordance. The κ-value also provides insight into the disparity of nonconcordant cases.
[Table 5] demonstrates previous studies of interobserver reproducibility of Gleason scores (2-10) compared with the present study. From this table it is evident that reproducibility of Gleason scores for exact agreement in previous studies ranges from 9.9% to 70.8%. [Table 6] demonstrates previous studies of interobserver reproducibility of Gleason score groups (2, 4, 5-6, 7 and 8-10) compared with the present study.
|Table 5: Interobserver reproducibility of Gleason scores (2-10) for exact agreement, κ-agreement: Previous studies and present study|
Click here to view
|Table 6: Interobserver reproducibility of Gleason score groups (2-4, 5-6, 7, and 8-10) for exact agreement and κ agreement: Previous studies and present study|
Click here to view
Comparisons of the results between studies of observer variation is affected by the number and experience of the participants, the number and selection of slides, variation in the criteria for Gleason grading, whether the criteria are agreed before or during the course of the study and methods of analysis, in particular the choice of groupings for the Gleason score. 
In the present study 2-4 Gleason score were assigned to only 0.5% of the readings. The shift away from reporting of Gleason scores 2-4 has been identified in other studies also and seems to be related to changes in interpretation over time. , Gleason score 7 is identified as area of difficulty in both this study and elsewhere. , Fourteen out of 63 readings of slides(22%) [Table 2] with a consensus score of 7 were underscored in the present study which is more than the study of Melia et al.  in which underscoring of consensus score 7 were seen in 13%. These differences centered on the assessment of small areas of fusion and the distinction between separate and fused small irregular glands arranged in a compact form. In addition, at times, it may be difficult to determine whether the loss of acinar spaces is caused by compression artifact or by real inability to form spaces. This difficulty has the potential to lead to inappropriate investigation and suboptimal patient management. Agreement is needed on the definition of small irregular areas of gland fusion (not confirming to Gleason cribriform pattern 3) which can be uniformly applied. Clarification is needed on: The morphology of fusion (possibly involving identification of a common wall): The minimum number of glands involved: The number of small areas of gland fusion required for assigning Gleason pattern 4.
With regards to pattern 4 and 5, the limits and proportions of tiny poorly defined acinar structures versus cords and nests of cells appeared to be a problem in the present study as elsewhere.  In the present study sheets of cells with many lumina formations were considered erroneously as grade 5 in many readings instead of grade 4.
A tertiary grade was reported infrequently and there was poor agreement on the presence and the individual grades of tertiary pattern. Although its prognostic value has been reported,  it could not be reliably used in practice as identified in other study. 
It has been observed that general pathologists more frequently underscore than overscore. ,,,, However, in the present study underscoring was seen in 23% and overscoring in 33.7% [Table 2]. This could be due to the fact that the present study included prostatectomy as well as TURP specimens besides needle biopsy specimens and earlier studies showed that under grading is seen in needle biopsies specimens. ,, Moreover grading 2-4 is discouraged on needle biopsy as often higher grade is found on subsequent prostatectomy specimen. ,, Allsbrook et al.  have the opinion that the low-grade tumors must be more clearly defined, as even experienced urological pathologists do not show good interobserver agreement. Our study involved general pathologists and it has been reported earlier that the reproducibility among genitourinary pathologists are better than the general pathologists. ,,
In our study poor agreement (κ<0) for Gleason score was seen in 4% of the readings [Table 4]. This can have major impact on the treatment as two pathologists may grade differently without any agreement between them. In our setup we get 4-5 cases of adenocarcinoma of prostate in a year and the experience of the pathologists and how they learned the Gleason grading do play a role in the reproducibility of Gleason grading system as identified in previous studies. ,
One of the major problems faced in the present study was evaluating the percentages of different pattern present in the slides. In some cases two distinct patterns were seen in approximately equal proportions, complicating the choice of a primary Gleason grade. In some cases, difference of opinion on Gleason grade for the entire slide could be explained by the presence of an approximately equal number of patterns as seen in previous studies also. 
Specific problem areas in the present study similar to previous studies were: ,
- recognition of the border areas between Gleason patterns, particularly recognition of the border areas between pattern 3 and pattern 4;
- recognition of invasive growth between pre-existing benign duct/acinar structures as pattern 3;
- interpretation of cribriform carcinoma as pattern 3 or 4 depending on the margin of the tumor.
It has been documented that significant improvements in Gleason grading have been accomplished when Gleason grading tutorials, including Web-based tutorials, are made available. 
A final comment should be made regarding the improvement of the interobserver reproducibility of Gleason grading. Subjectivity will always be present in any grading system. The study by Mikami et al.  indicates that good agreement for Gleason grading can be achieved by understanding the definition of each pattern in the scheme, as well as the pitfalls. In addition, although a lecture component could strengthen the understanding of the attendees, and was expected to be the superior educational method, printed material using a case-oriented approach played a comparable role, which seemed to be superior to self-learning from a standard textbook with a limited number of photomicrographs. Disagreement in grading can be attributed to various factors, including heterogeneity of a given tumor consisting of various patterns and the existence of morphologically borderline tumors.  Carlson et al.  demonstrated that a standardized protocol can minimize observer variability, and Egevad et al.  showed that a set of reference images may significantly improve the reproducibility of grading. In the present study a lecture was taken, before the commencement of the study, on Gleason grading and written material about the reporting of Gleason grading based on current practices of Gleason grading along with the photomicrographs of different Gleason patterns were distributed to the participating pathologists. In spite of this the level of agreement achieved was not satisfactory. So we come to a conclusion that the experience of the pathologists with the Gleason grading play a significant role as reported elsewhere. ,,
The International Society of Urological Pathology (ISUP) convened a conference in 2005.  This conference led to consensus development of "2005 ISUP Modified Gleason System" which recommended that initial grading of prostate carcinoma should be performed at low magnification, patterns 1-5 were clearly described and differences in the interpretation of biopsies and prostatectomy specimens were indicated. Overall, the recommendations follow a trend towards the use of higher grades than before. Further studies are needed whether "2005 ISUP Modified Gleason System" increases the interobserver reproducibility among general pathologists and its impact on patients outcome over a period of time.
De la Taille et al.  presented a noble approach to evaluate Gleason grading among pathologists using high-density tissue microarrays (TMA) and they concluded that a Gleason score can be easily assigned to each TMA spot of a 0.6 mm-diameter prostate cancer sample and their data indicated that using TMA spot images may be a good approach for teaching the Gleason grading system due to small areas of tissue.
To obtain optimal, although never perfect, results for our educational efforts, these varying opinions must, whenever possible, be reconciled, and a greater consensus must be developed. First, consensus itself will have to be defined. Some of the issues (e.g., how the actual grades should be reported) are more amenable to consensus. Resolution of other issues (e.g., the question of whether poorly defined/incomplete glands represent pattern 4) will ultimately require comparison of large series of cases, possibly including a review of the slides from these series, in itself is a daunting task.  Another approach towards improvement of reproducibility of Gleason grading system includes obtaining a second opinion in those cases where the grade could significantly influence management. This has been shown to be effective for grading of prostatic cancer.  All of these possible aids to improve accuracy will have important resource and management implications for the patients.
| » Acknowledgment|| |
We are thankful to Dr. Rakhi Jagdale and Mrs G.S.Garad for their help.
| » References|| |
|1.||Ozdamar SO, Sarikaya S, Yildiz L, Atilla MK, Kandemir B, Yildiz S. Intraobserver and interobserver reproducibility of WHO and Gleason histologic grading systems in prostatic adenocarcinomas. Int Urol Nephrol 1996;28:73-7. |
|2.||Jemal A, Siegel R, Ward E, Murray T, Xu J, Thun MJ. Cancer statistics, 2007. CA Cancer J Clin 2007;57:43-66. |
|3.||De la Taille A, Viellefond A, Berger N, Boucher E, De Fromont M, Fondimare A, et al. Evaluation of the interobserver reproducibility of Gleason grading of prostatic adenocarcinoma using tissue microarrays. Hum Pathol 2003;34:444-9. |
|4.||Melia J, Moseley R, Ball RY, Griffiths DF, Grigor K, Harnden P, et al. A UK-based investigation of inter- and intra-observer reproducibility of Gleason grading of prostatic biopsies. Histopathology 2006;48:644-54. |
|5.||Bain GO, Koch M, Hanson J. Feasibility of grading prostatic carcinoma. Arch Pathol Lab Med 1982;106:265-7. |
|6.||Allsbrook WC Jr, Mangold KA, Johnson MH, Lane RB, Lane CG, Epstein JI. Interobserver reproducibility of Gleason grading of prostatic carcinoma: General pathologist. Hum Pathol 2001;32:81-8. |
|7.||Allsbrook WC Jr, Mangold KA, Johnson MH, Lane RB, Lane CG, Amin MB. Interobserver reproducibility of Gleason grading of prostatic carcinoma: Urologic pathologists. Hum Pathol 2001;32:74-80. |
|8.||Griffiths DF, Melia J, McWilliam LJ, Ball RY, Grigor K, Harnden P, et al. A study of Gleason score interpretation in different groups of UK pathologists; Techniques for improving reproducibility. Histopathology 2006;48:655-62. |
|9.||Mikami Y, Manabe T, Epstein JI, Shiraishi T, Furusato M, Tsuzuki T, et al. Accuracy of gleason grading by practicing pathologists and the impact of education on improving agreement. Hum Pathol 2003;34:658-65. |
|10.||Epstein JI, Algaba F, Allsbrook WC Jr, Bastacky S, Boccon-Gibod L, De Marzo AM, et al. Acinar adenocarcinoma. In, Eble JN, Sauter G, Epstein JI, Sesterhenn IA, editors. World Health Organization Classification of Tumours. Pathology and Genetics of Tumours of the Urinary System and Male Genital Organs. Lyon, France: IARC Press; 2004. p. 179-84. |
|11.||Epstein JI, Allsbrook WC Jr, Amin MB, Egevad LL. ISUP Grading Committee. The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma. Am J Surg Pathol 2005;29:1228-42. |
|12.||Epstein JI, Allsbrook WC Jr, Amin MB, Egevad LL. Update on the Gleason grading system for prostate cancer: Results of an international consensus conference of urologic pathologists. Adv Anat Pathol 2006;13:57-9. |
|13.||Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960;20:37-46. |
|14.||Fleiss JL. Statistical methods for rates and proportion. 2 nd ed. New York: Wiley; 1981. |
|15.||Epstein JI, Partin AW, Sauvageot J, Walsh PC. Prediction of progression following radical prostatectomy. A multivariate analysis of 721 men with long-term follow-up. Am J Surg Pathol 1996;20:286-92. |
|16.||Smith EB, Frierson HF Jr, Mills SE, Boyd JC, Theodorescu D. Gleason scores of prostate biopsy and radical prostatectomy specimens over the past 10 years: Is there evidence for systematic upgrading? Cancer 2002;94:2282-7. |
|17.||Renshaw AA, Schultz D, Cote K, Loffredo M, Ziemba DE, D'Amico AV. Accurate Gleason grading of prostatic adenocarcinoma in prostate needle biopsies by general pathologists. Arch Pathol Lab Med 2003;127:1007-8. |
|18.||Pan CC, Potter SR, Partin AW, Epstein JI. The prognostic significance of tertiary Gleason patterns of higher grade in radical prostatectomy specimens: A proposal to modify the Gleason grading system. Am J Surg Pathol 2000;24:563-9. |
|19.||Egevad L. Reproducibility of Gleason grading of prostate cancer can be improved by the use of reference images. Urology 2001;57:291-5. |
|20.||Epstein JI. Gleason score 2-4 adenocarcinoma of the prostate on needle biopsy: A diagnosis that should not be made. Am J Surg Pathol 2000;24:477-8. |
|21.||di Loreto C, Fitzpatrick B, Underhill S, Kim DH, Dytch HE, Galera-Davidson H, et al. Correlation between visual clues, objective architectural features, and interobserver agreement in prostate cancer. Am J Clin Pathol 1991;96:70-5. |
|22.||Oyama T, Allsbrook WC Jr, Kurokawa K, Matsuda H, Segawa A, Sano T, et al. A comparison of interobserver reproducibility of Gleason grading of prostatic carcinoma in Japan and the United States. Arch Pathol Lab Med 2005;129:1004-10. |
|23.||Kronz JD, Silberman MA, Allsbrook WC Jr, Epstein JI, Bastacky SI, Burks RT, et al. Pathology residents' use of a web- based tutorial to improve Gleason grading of prostate carcinoma on needle biopsy. Hum Pathol 2000;31:1044-50. |
|24.||Steinberg DM, Sauvageot J, Piantadosi S, Epstein JI. Correlation of prostate needle biopsy and radical prostatectomy Gleason grade in academic and community settings. Am J Surg Pathol 1997;21:566-76. |
|25.||Carlson GD, Calvanese CB, Kahane H, Epstein JI. Accuracy of biopsy Gleason scores from a large uropathology laboratory: Use of a diagnostic protocol to minimize observer variability. Urology 1998;51:525-9. |
|26.||Harada M, Mostofi FK, Corle DK, Byar DP, Trump BF. Preliminary studies of histologic prognosis in cancer of the prostate. Cancer Treat Rep 1977;61:223-5. |
|27.||Svanholm H, Mygind H. Prostatic carcinoma reproducibility of histologic grading. Acta Pathol Microbiol Immunol Scand A 1985;93:67-71. |
|28.||Rousselet MC, Saint-Andre JP, Six P, Soret JY. Reproducibility and prognostic value of Gleason's and Gaeta's histological grades in prostatic carcinoma. Ann Urol (Paris) 1986;20:317-22. |
|29.||McLean M, Srigley J, Banerjee D, Warde P, Hao Y. Interobserver variation in prostate cancer Gleason scoring: Are there implications for the design of cliniclal trials and treatment strategies? Clin Oncol (RColl Radiol) 1997; 9:222-5. |
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6]
|This article has been cited by|
||A Gleason-osztályozás reprodukálhatóságának vizsgálata prosztata-tubiopsziás mintákban
| ||Rita Bori,Ferenc Salamon,Csaba Móczár,Gábor Cserni |
| ||Orvosi Hetilap. 2013; 154(31): 1219 |
|[Pubmed] | [DOI]|