This directory is code for FinSeer.
- The datasets used in our paper are uploaded in FinSeer_data.
- Our retriever is uploaded in FinSeer.
- Our fine-tuned stock llm is uploaded in StockLLM
# for baseline RAG models and retriever training
pip install InstructorEmbedding
pip install -U FlagEmbedding
pip install sentence-transformers==2.2.2
pip install protobuf==3.20.0
pip install yahoo-finance
python -m pip install -U angle-emb
pip install transformers==4.33.2 # UAE
step 1. get llm feedback scores (src/2_train_retriever/get_llm_feedback_scores.py)
- dataset: acl18, bigdata22, stock23
- target: the file to save, a json with llm probability scores
parser = argparse.ArgumentParser(description='test')
parser.add_argument('--dataset', default='acl18', type=str)
parser.add_argument('--target', default='acl18.scored.json', type=str)
args = parser.parse_args()
get_all_scores(llm='StockLLM') # llama family are all supported for this code
step 2. select positive and negative candidates (src/2_train_retriever/select_positive_and_combine.py)
Before the step, you should have generated acl18.scored.json, bigdata22.scored.json, stock23.scored.json
This step is to select candidates for all three datasets, and generate a combined train.scored.json file.
Then, follow the steps in this link, you can finetune your own 'FinSeer' using train.scored.json data.
step 1. get embeddings of queries and candidates (src/3_stock_movement_prediction/get_embeddings.py)
- q_or_c: query or candidate, we generate the embeddings of query sequences and candidate sequences separately
parser = argparse.ArgumentParser(description='test')
parser.add_argument('--test_dataset', default='bigdata22', type=str)
parser.add_argument('--embedding_model', default='e5',
choices=['instructor', 'uae', 'bge', 'llm_embedder', 'e5', 'FinSeer'])
parser.add_argument('--q_or_c', default='candidate')
args = parser.parse_args()
step 2. calculate similarity of query and qualified candidates (and get top-5 related candidates)
step 3. predict stock movement with or without retrieval That is what our three files do,
- 1_no_retrieval.py
- 2_random_retrieval.py
- 3_similarity_retrieval.py