Labeling text using Doccano
Doccano is an open source text annotation tool. It can be used to create labeled datasets for:
- Text classification
- Entity extraction
- Sequence to sequence translation
Doccano can be used to create labeled data for training the EntityRecongnizer
model in arcgis.learn
.
This software is created by: Hiroki Nakayama and Takahiro Kubo and Junya Kamura and Yasufumi Taniguchi and Xu Liang
How to label training data for named entity recognition with doccano
- After Doccano has been deployed to the local machine, go to Doccano hompage and login with your credentials.
- Select appropriate project type
- If data import needed for annotation, go to Dataset from the left panel then click on Actions > Import dataset.
- Select 'JSONL' and then click on 'Select file(s)' and point it to the reports file (docanno_deployment\reports_label.jsonl). Alternatively, text documents can also be uploaded using the ‘Plain text’ option.
- After the file has been imported, you will see the documents loaded on the screen.
- Click on 'Start annotation' from the top menu bar.
- Analyze the document (use the bottom navigation bar for shifting through the docs). Mark sequences with your mouse and select the relevant title.
- New labels can also be created by navigating to ‘Labels’ from the left panel.
- Once all the documents have been labeled, go to 'Dataset' > 'Actions' > 'Export dataset'.
- Select JSONL(Text-Labels).
- Set an export file name.
- Click Export.