Python Multilingual BibTeX Generator
This Python script generates multilingual_papers.bib
by filtering the original anthology+abstracts.bib
file for multilingual NLP research papers.
Features
- Identical Logic: Uses the same filtering logic as the JavaScript web application
- Comprehensive Detection: Detects multilingual papers using keywords and language names
- LaTeX Cleaning: Properly handles LaTeX commands and formatting
- Statistics: Provides detailed statistics about the filtering process
- Safe Operation: Checks for existing files and asks for confirmation before overwriting
Requirements
- Python 3.6 or higher
- No external dependencies (uses only standard library)
Usage
- Place your files: Ensure
anthology+abstracts.bib
is in the same directory as the script - Run the script:
python generate_multilingual_bib.py
- Follow prompts: The script will ask for confirmation if
multilingual_papers.bib
already exists
Output
The script will:
- Generate
multilingual_papers.bib
containing only multilingual papers - Display statistics about the filtering process
- Show the top 10 most common keywords found
Example Output
Reading anthology+abstracts.bib...
Parsing BibTeX entries...
Found 50000 total papers
Found 2500 multilingual papers
Generating BibTeX content...
Writing to multilingual_papers.bib...
Successfully generated multilingual_papers.bib with 2500 papers!
Statistics:
Total papers processed: 50000
Multilingual papers found: 2500
Percentage multilingual: 5.0%
Top 10 keywords found:
multilingual: 1200 papers
chinese: 800 papers
crosslingual: 600 papers
hindi: 400 papers
low-resource: 350 papers
korean: 300 papers
arabic: 250 papers
japanese: 200 papers
spanish: 180 papers
french: 150 papers
Filtering Criteria
The script uses the same criteria as the web application:
Multilingual Keywords
- multilingual, crosslingual, multi-lingual, cross-lingual
- low-resource language, low resource language
- low-resource, low resource
Language Names
- 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc.
- Regional language variations and dialects
Customization
You can modify the filtering criteria by editing the constants at the top of the script:
MULTILINGUAL_KEYWORDS = [
'multilingual', 'crosslingual', 'multi lingual',
# Add your custom keywords here
]
LANGUAGE_NAMES = [
'afrikaans', 'albanian', 'amharic', 'arabic',
# Add more language names here
]
Error Handling
The script includes robust error handling:
- Checks for input file existence
- Handles malformed BibTeX entries gracefully
- Provides clear error messages
- Asks for confirmation before overwriting existing files
Performance
- Efficient regex-based parsing
- Memory-efficient processing for large files
- Fast keyword matching using set operations
Troubleshooting
File Not Found
Error: anthology+abstracts.bib not found in current directory.
Solution: Ensure the input file is in the same directory as the script.
No Papers Found
No multilingual papers found. Check your keywords and language lists.
Solution: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists.
Encoding Issues
If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded.
Comparison with JavaScript Version
This Python script produces identical results to the JavaScript web application:
- Same filtering logic
- Same LaTeX cleaning
- Same BibTeX output format
- Same keyword detection
The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process.