Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Scored Output Files from Algorithm Execution #292

Open
ruixing76 opened this issue Dec 8, 2024 · 11 comments
Open

Request for Scored Output Files from Algorithm Execution #292

ruixing76 opened this issue Dec 8, 2024 · 11 comments

Comments

@ruixing76
Copy link

ruixing76 commented Dec 8, 2024

Is your feature request related to a problem? Please describe.

I need aggregated results so I can analyze helpfulness on Notes level. I think the only way so far is to run the algorithm from scratch so I’m reproducing results using the downloaded data (notes, ratings, notes history, and user enrollment) on a 64-core Intel(R) Xeon(R) Gold 6448H CPU with 500GB memory (correct me if I am wrong). However, after 20 hours, the pre-scoring phase still hasn’t completed. It looks like it won't finish within one day which stops me from working on further analysis.

Since the algorithm runs every hour or so on the server, may I know:

  1. would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)?
  2. and the hardware requirement and expected running time if I want to generate aggregated scores for notes myself?

This would greatly help for research analysis, as running the algorithm locally to aggregate helpfulness scores has been quite challenging.

Describe the solution you'd like
Would it be possible to share the output files (scored_notes.tsv, helpfulness_scores.tsv, note_status_history.tsv, and aux_note_info.tsv)? They don’t need to be the latest versions—files aligned with the current download page would be fine.

Describe alternatives you've considered
It would be nice to share the hardware requirement and expected running time if I want to generate aggregated scores for notes from scratch, or any intermediate process.

Additional context
Thank you so much for your contribution on this amazing project! I am a PhD student working on fact-checking in Natural Language Processing and I am very happy to explore and contribute more. I am actively working on this and any help in above questions would be much appreciated!

@ashilgard
Copy link

hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!

@ruixing76
Copy link
Author

hi - it's not surprising that the job might take that long when run sequentially. since you seem not to be resource-bound, you could try running with the parallel flag set to True. Let us know if that helps!

Hi @ashilgard, many thanks for your reply, I will try that! Actually I do have resource limitations and normally we don't have that much CPU and memory (64G at most), I queued for a very long time to run the algorithm It would be great if it's possible to share results and descriptions of the output formats.

@tuler
Copy link

tuler commented Jan 9, 2025

I also would appreciate more insights about hardware requirements and expected running time.
I'm trying to run it for weeks, but failed.
My latest attempt was to use an AWS r5.metal instance, which is a 3rd gen Intel Xeon with 768Gb of RAM, but after running for 9 hours the process died with no forensic information.
This is my attempt output: https://gist.github.com/tuler/02aa42c423e5a627a0ea5fa5b9381f7b
I used --parallel

@jbaxter
Copy link
Collaborator

jbaxter commented Jan 9, 2025

Could anyone external who has successfully run the algorithm code share their machine, runtime, and how many threads/processes they used if different than default? E.g. @avalanchesiqi I think you may have?

To give a ballpark, it will likely take in the ballpark of 12ish hours if run with default multiprocessing settings (of course, highly dependent on the exact processor).

768G RAM is more than what we need internally. Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?

@tuler
Copy link

tuler commented Jan 9, 2025

Could you share any charts of resource usage @tuler? E.g. RAM and CPU usage over time?

I don't have a chart, but I saw memory increasing during pre-processing, until it reaches something around 180Gb of memory usage. And that is single core, 100% CPU all the time, for like 5 hours.
Then models start to run in parallel, I see like 8 cores working, memory goes down a lot, to like 30Gb.
Does it make sense?

@jbaxter
Copy link
Collaborator

jbaxter commented Jan 9, 2025

Yeah that seems fine. Not sure why it stopped.

@avalanchesiqi
Copy link
Contributor

@tuler I checked your log. I think your program had finished, however your output folder didn't exist. That is why your program stopped. I found this in your log file. It's located near the end, like one scroll up.

Traceback (most recent call last):
  File "/home/ubuntu/communitynotes/sourcecode/main.py", line 31, in <module>
    main()
  File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 268, in main
    return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
  File "/home/ubuntu/communitynotes/sourcecode/scoring/pandas_utils.py", line 678, in _inner
    retVal = main(*args, **kwargs)
  File "/home/ubuntu/communitynotes/sourcecode/scoring/runner.py", line 245, in _run_scorer
    write_tsv_local(scoredNotes, os.path.join(args.outdir, "scored_notes.tsv"))
  File "/home/ubuntu/communitynotes/sourcecode/scoring/process_data.py", line 543, in write_tsv_local
    assert df.to_csv(path, index=False, header=headers, sep="\t") is None
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/core/generic.py", line 3967, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/format.py", line 1014, in to_csv
    csv_formatter.save()
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/formats/csvs.py", line 251, in save
    with get_handle(
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 749, in get_handle
    check_parent_directory(str(handle))
  File "/home/ubuntu/.env/lib/python3.10/site-packages/pandas/io/common.py", line 616, in check_parent_directory
    raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
OSError: Cannot save file into a non-existent directory: '../output'

@tuler
Copy link

tuler commented Jan 15, 2025

Thanks for checking it out @avalanchesiqi
It’s strange because other times I tried to run, with a data subset, the directory gets created, even with intermediate results in it.
I’ll try to run it again. Thanks

@presnick
Copy link

We did get it to complete on the umich HPC cluster, using 170GB max memory. I think the max it is set up to offer is 184GB, so we are OK until the memory requirement grows above that.

I think @avalanchesiqi traced the memory issue to some relatively recently added computation involving all pairs of users or something like that. Can you point to exactly where you traced the issue, Siqi?

Providing outputs from your internal runs would certainly be useful to us folks on the outside, though I undertsand that this may have been a deliberate design decision to make the code and data available but with a little friction so that only people who were serious would be able to reproduce results.

(BTW: it would also be helpful if your scoring runs recorded the inferred global parameter /mu in an output file, rather than just in the logs. We may submit a PR for that.)

@tuler
Copy link

tuler commented Jan 16, 2025

Now I successfully ran it. I took 10:30h to run in a r5.metal AWS instance.

@Jacobsonradical
Copy link

Jacobsonradical commented Jan 16, 2025

@tuler is GPU necessary?

Edit: Okay I suppose the thumbsdown means not necessary. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants